[Python-Dev] The first trustworthy <wink> GBayes results

Tue, 03 Sep 2002 13:53:36 -0400

[Tim again]
>> I must be missing something.  I would *hope* that you review
>> *all* messages claimed to be spam, in which case the number of msgs
>> to be reviewed would, in a perfectly accurate system, be equal to the
>> number of spams received.

[Greg Ward]
> Good lord, certainly not!  Remember that Exim rejects a couple hundred
> messages a day that never get near SpamAssassin -- that's mostly
> Chinese/Korean junk that's rejected on the basis of 8-bit chars or
> banned charsets in the headers.  Then, probably 50-75% of what SA gets
> its hands on scores >= 10.0, so it too is rejected at SMTP time.  Only
> messages that score < 10 are accepted, and those that score >= 5.0 are
> set aside in /var/mail/spam for review.  That's 10-30 messages/day.
>
> (I do occasionally scan Exim's reject log on mail.python.org to see
> what's getting rejected today -- Exim kindly logs the full headers of
> every message that is rejected after the DATA command.  I usually make
> it to about 11am of a given day's logfile before my eyes glaze over from
> the endless stream of spam and viruses.)

I get about 200 spams per day on my own email accounts, and look at all of
them.  I don't look at the headers at all, I just look at the msgs in a
capable HTML-aware mail reader, as a matter of course while dealing with all
the day's email.  It's rare that it takes more than a second to recognize a
spam by eyeball and hit the delete key.  At about 200 per day, it's just now
reaching my "hmm, this is becoming a nuisance sometimes" threshold.  Our
tolerance levels for manual review seem to differ by a factor of 100 or more
<wink>.

> Note that we *used* to accept messages before passing them to
> SpamAssassin, so never rejected anything on the basis of its SA score.
> Back then, we saved and reviewed probably 50-70 messages/day.  Very,
> very, very few (if any) false positives scored >= 10.0, which is why
> that's the threshold for SMTP-time rejection.

I can tell you the mean false negative and false positive rates on what I've
been working on, and even measure their variance across both training and
prediction sets.  (The fn rate is well under 2% now (adding in more headers
should improve that a lot), and the fp rate under 0.05% (but I doubt that
adding in more headers will improve this)).  So long as we don't know the
rates for the scheme you're using now, there's no objective basis for
comparison.

...

>> Maybe you don't want this kind of approach at all.  The classifier
doesn't
>> have "gray areas" in practice:  it tends to give probabilites near 1, or
>> near 0, and there's very little in between -- a msg either has a
>> preponderance of spam indicators, or a preponderance of non-spam
>> indicators.

> That's a great improvement over SpamAssassin then: with SA, the grey
> area (IMHO) is scores from 3 to 10... which is why several python.org
> lists now have a little bit of Mailman configuration magic that makes MM
> set aside messages with an SA score >= 3 for list admin review.  (It's
> probably worth getting the list admin to do a bit more work in order to
> avoid sending low-scoring spam to the list.)
>
> However, as long as "very little" != "nothing", we still need to worry a
> bit about that grey area.  What do you think we should do with a message
> whose spam probability is between (say) 0.1 and 0.9?  Send it on, reject
> it, or set it aside?

Under Graham's scheme, send it on.  It doesn't have grey areas in a useful
sense, becuase the scoring step only looks at a handful of extremes:
extremes in, extremes out, and when it's wrong it's *spectacularly* wrong
(e.g., the very rare (< 0.05%) false positives generally have "probabilties"
exceeding 0.99, and a false negative often has a "probability" less then
0.01).

> Just how many messages fall in that grey area anyways?

I can't get at my testing setup now and don't know the answer offhand.  I'll
try to make time tonight to determine the answer.  I guess the interesting
stats are what percent of hams have probs in (0.1, 0.9), and what percent of
spams.  In general, it's only very brief messages that don't score near 0.0
or 1.0, so this *may* turn out to be the same thing as asking what
percentages of hams and spams are very brief.

Note too that adding the headers in *should* catch a lot more spam under
this scheme.  But, even as is, and even if I strip all the HTML tags out of
spam, fewer than 1 spam in 50 scores less than 0.9.  The ones that are
passed on now include all spams with empty bodies (a message with an empty
body scores 0.5).