[Tim, last week]
What's an acceptable false positive rate?
Speaking as one of the people who reviews suspected spam for python.org and rescues false positives, I would say that the more relevant figure is: how much suspected spam do I have to review every morning? < 10 messages would be peachy; right now it's around 5-20 messages per day.
I must be missing something. I would *hope* that you review *all* messages claimed to be spam, in which case the number of msgs to be reviewed would, in a perfectly accurate system, be equal to the number of spams received.
Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day.
(I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.)
Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection.
OTOH, the false positive rate doesn't have anything to do with the number of spams received, it has to do with the number of non-spams received.
Err, yeah, good point. I make a point of talking about "suspected spam", which is any message that scores between 5.0 and 10.0. IMHO, the true nature of those messages can only be determined by manual inspection.
Maybe you don't want this kind of approach at all. The classifier doesn't have "gray areas" in practice: it tends to give probabilites near 1, or near 0, and there's very little in between -- a msg either has a preponderance of spam indicators, or a preponderance of non-spam indicators.
That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.)
However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside? Just how many messages fall in that grey area anyways?