RE: [Python-Dev] The first trustworthy <wink> GBayes results

3 Sep 2002

      [Tim again]
...
...
I must be missing something.  I would *hope* that you review
*all* messages claimed to be spam, in which case the number of msgs
to be reviewed would, in a perfectly accurate system, be equal to the
number of spams received.
[Greg Ward]
...
Good lord, certainly not!  Remember that Exim rejects a couple hundred
messages a day that never get near SpamAssassin -- that's mostly
Chinese/Korean junk that's rejected on the basis of 8-bit chars or
banned charsets in the headers.  Then, probably 50-75% of what SA gets
its hands on scores >= 10.0, so it too is rejected at SMTP time.  Only
messages that score < 10 are accepted, and those that score >= 5.0 are
set aside in /var/mail/spam for review.  That's 10-30 messages/day.
(I do occasionally scan Exim's reject log on mail.python.org to see
what's getting rejected today -- Exim kindly logs the full headers of
every message that is rejected after the DATA command.  I usually make
it to about 11am of a given day's logfile before my eyes glaze over from
the endless stream of spam and viruses.)
I get about 200 spams per day on my own email accounts, and look at all of
them.  I don't look at the headers at all, I just look at the msgs in a
capable HTML-aware mail reader, as a matter of course while dealing with all
the day's email.  It's rare that it takes more than a second to recognize a
spam by eyeball and hit the delete key.  At about 200 per day, it's just now
reaching my "hmm, this is becoming a nuisance sometimes" threshold.  Our
tolerance levels for manual review seem to differ by a factor of 100 or more
<wink>.
...
Note that we *used* to accept messages before passing them to
SpamAssassin, so never rejected anything on the basis of its SA score.
Back then, we saved and reviewed probably 50-70 messages/day.  Very,
very, very few (if any) false positives scored >= 10.0, which is why
that's the threshold for SMTP-time rejection.
I can tell you the mean false negative and false positive rates on what I've
been working on, and even measure their variance across both training and
prediction sets.  (The fn rate is well under 2% now (adding in more headers
should improve that a lot), and the fp rate under 0.05% (but I doubt that
adding in more headers will improve this)).  So long as we don't know the
rates for the scheme you're using now, there's no objective basis for
comparison.

...
...
...
Maybe you don't want this kind of approach at all.  The classifier
doesn't
have "gray areas" in practice:  it tends to give probabilites near 1, or
near 0, and there's very little in between -- a msg either has a
preponderance of spam indicators, or a preponderance of non-spam
indicators.
...
That's a great improvement over SpamAssassin then: with SA, the grey
area (IMHO) is scores from 3 to 10... which is why several python.org
lists now have a little bit of Mailman configuration magic that makes MM
set aside messages with an SA score >= 3 for list admin review.  (It's
probably worth getting the list admin to do a bit more work in order to
avoid sending low-scoring spam to the list.)
However, as long as "very little" != "nothing", we still need to worry a
bit about that grey area.  What do you think we should do with a message
whose spam probability is between (say) 0.1 and 0.9?  Send it on, reject
it, or set it aside?
Under Graham's scheme, send it on.  It doesn't have grey areas in a useful
sense, becuase the scoring step only looks at a handful of extremes:
extremes in, extremes out, and when it's wrong it's *spectacularly* wrong
(e.g., the very rare (< 0.05%) false positives generally have "probabilties"
exceeding 0.99, and a false negative often has a "probability" less then
0.01).
...
Just how many messages fall in that grey area anyways?
I can't get at my testing setup now and don't know the answer offhand.  I'll
try to make time tonight to determine the answer.  I guess the interesting
stats are what percent of hams have probs in (0.1, 0.9), and what percent of
spams.  In general, it's only very brief messages that don't score near 0.0
or 1.0, so this *may* turn out to be the same thing as asking what
percentages of hams and spams are very brief.

Note too that adding the headers in *should* catch a lot more spam under
this scheme.  But, even as is, and even if I strip all the HTML tags out of
spam, fewer than 1 spam in 50 scores less than 0.9.  The ones that are
passed on now include all spams with empty bodies (a message with an empty
body scores 0.5).

RE: [Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters