[Python-Dev] The first trustworthy <wink> GBayes results
Greg Ward
greg@python.org
Tue, 3 Sep 2002 09:41:12 -0400
[Tim, last week]
> What's an acceptable false positive rate?
[my response]
> Speaking as one of the people who reviews suspected spam for python.org
> and rescues false positives, I would say that the more relevant figure
> is: how much suspected spam do I have to review every morning? < 10
> messages would be peachy; right now it's around 5-20 messages per day.
[Tim again]
> I must be missing something. I would *hope* that you review *all* messages
> claimed to be spam, in which case the number of msgs to be reviewed would,
> in a perfectly accurate system, be equal to the number of spams received.
Good lord, certainly not! Remember that Exim rejects a couple hundred
messages a day that never get near SpamAssassin -- that's mostly
Chinese/Korean junk that's rejected on the basis of 8-bit chars or
banned charsets in the headers. Then, probably 50-75% of what SA gets
its hands on scores >= 10.0, so it too is rejected at SMTP time. Only
messages that score < 10 are accepted, and those that score >= 5.0 are
set aside in /var/mail/spam for review. That's 10-30 messages/day.
(I do occasionally scan Exim's reject log on mail.python.org to see
what's getting rejected today -- Exim kindly logs the full headers of
every message that is rejected after the DATA command. I usually make
it to about 11am of a given day's logfile before my eyes glaze over from
the endless stream of spam and viruses.)
Note that we *used* to accept messages before passing them to
SpamAssassin, so never rejected anything on the basis of its SA score.
Back then, we saved and reviewed probably 50-70 messages/day. Very,
very, very few (if any) false positives scored >= 10.0, which is why
that's the threshold for SMTP-time rejection.
> OTOH, the false positive rate doesn't have anything to do with the number of
> spams received, it has to do with the number of non-spams received.
Err, yeah, good point. I make a point of talking about "suspected
spam", which is any message that scores between 5.0 and 10.0. IMHO, the
true nature of those messages can only be determined by manual
inspection.
> Maybe you don't want this kind of approach at all. The classifier doesn't
> have "gray areas" in practice: it tends to give probabilites near 1, or
> near 0, and there's very little in between -- a msg either has a
> preponderance of spam indicators, or a preponderance of non-spam indicators.
That's a great improvement over SpamAssassin then: with SA, the grey
area (IMHO) is scores from 3 to 10... which is why several python.org
lists now have a little bit of Mailman configuration magic that makes MM
set aside messages with an SA score >= 3 for list admin review. (It's
probably worth getting the list admin to do a bit more work in order to
avoid sending low-scoring spam to the list.)
However, as long as "very little" != "nothing", we still need to worry a
bit about that grey area. What do you think we should do with a message
whose spam probability is between (say) 0.1 and 0.9? Send it on, reject
it, or set it aside? Just how many messages fall in that grey area
anyways?
Greg
--
Greg Ward <gward@python.net> http://www.gerg.ca/
MTV -- get off the air!
-- Dead Kennedys