[Python-Dev] The first trustworthy <wink> GBayes results

Tue, 3 Sep 2002 09:41:12 -0400

[Tim, last week]
> What's an acceptable false positive rate?

[my response]
> Speaking as one of the people who reviews suspected spam for python.org
> and rescues false positives, I would say that the more relevant figure
> is: how much suspected spam do I have to review every morning?  < 10
> messages would be peachy; right now it's around 5-20 messages per day.

[Tim again]
> I must be missing something.  I would *hope* that you review *all* messages
> claimed to be spam, in which case the number of msgs to be reviewed would,
> in a perfectly accurate system, be equal to the number of spams received.

Good lord, certainly not!  Remember that Exim rejects a couple hundred
messages a day that never get near SpamAssassin -- that's mostly
Chinese/Korean junk that's rejected on the basis of 8-bit chars or
banned charsets in the headers.  Then, probably 50-75% of what SA gets
its hands on scores >= 10.0, so it too is rejected at SMTP time.  Only
messages that score < 10 are accepted, and those that score >= 5.0 are
set aside in /var/mail/spam for review.  That's 10-30 messages/day.

(I do occasionally scan Exim's reject log on mail.python.org to see
what's getting rejected today -- Exim kindly logs the full headers of
every message that is rejected after the DATA command.  I usually make
it to about 11am of a given day's logfile before my eyes glaze over from
the endless stream of spam and viruses.)

Note that we *used* to accept messages before passing them to
SpamAssassin, so never rejected anything on the basis of its SA score.
Back then, we saved and reviewed probably 50-70 messages/day.  Very,
very, very few (if any) false positives scored >= 10.0, which is why
that's the threshold for SMTP-time rejection.

> OTOH, the false positive rate doesn't have anything to do with the number of
> spams received, it has to do with the number of non-spams received.

Err, yeah, good point.  I make a point of talking about "suspected
spam", which is any message that scores between 5.0 and 10.0.  IMHO, the
true nature of those messages can only be determined by manual
inspection.

> Maybe you don't want this kind of approach at all.  The classifier doesn't
> have "gray areas" in practice:  it tends to give probabilites near 1, or
> near 0, and there's very little in between -- a msg either has a
> preponderance of spam indicators, or a preponderance of non-spam indicators.

That's a great improvement over SpamAssassin then: with SA, the grey
area (IMHO) is scores from 3 to 10... which is why several python.org
lists now have a little bit of Mailman configuration magic that makes MM
set aside messages with an SA score >= 3 for list admin review.  (It's
probably worth getting the list admin to do a bit more work in order to
avoid sending low-scoring spam to the list.)

However, as long as "very little" != "nothing", we still need to worry a
bit about that grey area.  What do you think we should do with a message
whose spam probability is between (say) 0.1 and 0.9?  Send it on, reject
it, or set it aside?  Just how many messages fall in that grey area
anyways?

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
MTV -- get off the air!
    -- Dead Kennedys