I noticed that as well. When the classifier goes wrong it goes badly wrong and using different thresholds would not help. It seems that increasing the number of discriminators doesn't really help either. Too bad because otherwise you could flag those messages for human classification.
I think it's worse than just that: suppose any scheme says "OK, this is spam, with probability 0.9995". If it's reporting accurate probabilities, then another way to read that claim is "On average, one time in 2000 this message actually isn't spam". In real life we have to accept that there's no scheme with a 0% false positive rate-- not even human review --short of the scheme that never calls anything spam. Since deciding on the largest acceptable false positive rate is far more a social than a technical issue, a group of nerds will do anything rather than face it <wink>.