We have a client that is specifically interested in improving their answering machine detection.
This is often accomplished by testing and tweaking.
1: get a bunch of known answering machine numbers you can call and realistically expect an answering machine.
2: set up a test campaign with forced recording from the moment of answer
3: call 100 (or 1000) of them
4: Have an employee listen to all the recordings (which is why 100 instead of 1000, but the more you test the more accurate your statistics)
5: Repeat with expected human answers, with similar quantities.
6: Check the recordings again.
7: Compile your stats: How many AMs were marked "human" and passed to an agent vs how many humans were marked "AA". Those are your two necessary stats. Note however that having a Human marked Machine is a very bad thing, whereas having an answering machine sent to an agent is not really horrible.
So the goal is to have as close to 100% as possible for human detection while filtering as many machines as possible.
And now the tweaking begins. We last tested this scenario for a client about two (three?) years ago and found it to be Very Good. It actually was near 100% for human detection and just below 95% for AM. Average total was actually over 95%. The client said "wow" and just used it was it was, we never even had to tweak it.
But (as with most clients), those who pay for a statistical, measured analysis on this ALSO pay attention to the metrics for "with AMD" and "without AMD". They found their response to "with AMD" sucked by comparison to without AMD and completely abandoned the module (and never looked back).
I attribute the "suckiness" to the extra 2 seconds of silence required by the AMD module for detection. Added to the 1-1.5 seconds already needed to transfer a call to an agent, this delay becomes noticeable to the Callee and ... quite frankly pisses people off to the point that (IMHO) it explains that statistical suckiness quite easily.
Some rooms still use it. Just not as many as used to "back in the day".