[Python-Dev] RE: The first trustworthy <wink> GBayes results

Wed, 28 Aug 2002 21:19:38 -0400

[Greg Ward]
> One of the other perennial-seeming topics on spamassassin-devel (a list
> that I follow only sporodically) is that careful manual cleaning of your
> corpus is *essential*.  The concern of the main SA developers is that
> spam in your non-spam folder (and vice-versa) will prejudice the genetic
> algorithm that evolves SA's scores in the wrong direction.  Gut instinct
> tells me the Bayesian approach ought to be more robust against this sort
> of thing, but even it must have a breaking point at which misclassified
> messages throw off the probabilities.

Like all other questions <wink>, this can be quantified if someone is
willing to do the grunt work of setting up, running, and analyzing
appropriate experiments.  This kind of algorithm is generally quite robust
against disaster, but note that even tiny changes in accuracy rates can have
a large effect on *you*:  say that 99% of the time the system says a thing
is spam, it really is.  Then say that degrades by a measly 1%:  99% falls to
98%.  From *your* POV this is huge, because the error rate has actually
doubled (from 1% wrong to 2% wrong:  you've got twice as many false
positives to deal with).

So the scheme has an ongoing need for accurate human training (spam changes,
list topics change, list members change, etc; the system needs an ongoing
random sample of both new spam and new non-spam to adapt).

> ...
> One possibility occurs to me: we could build our own corpus by
> collecting spam on python.org for a few weeks.

Simpler is better:  as you suggested later, capture everything for a while,
and without injecting Mailman or SpamAssasin headers.  That won't be a
particularly good corpus for the lists in general, because over any brief
period a small number of topics and posters dominate.  But it will be a fair
test for how systems do over exactly that brief period <wink>.

> Here's a rough breakdown of mail rejected by mail.python.org over the
> last 10 days, eyeball-estimated messages per day:
>
>   bad RCPT                       150 - 300 [1]
>   bad sender                      50 - 190 [2]
>   relay denied                    20 - 180 [3]
>   known spammer addr/domain       15 -  60
>   8-bit chars in subject         130 - 200
>   8-bit chars in header addrs     10 -  60
>   banned charset in subject        5 -  50 [4]
>   "ADV" in subject                 0 -   5
>   no Message-Id header           100 - 400 [5]
>   invalid header address syntax    5 -  50 [6]
>   no valid senders in header      10 -  15 [7]
>   rejected by SpamAssassin        20 -  50 [8]
>   quarantined by SpamAssassin      5 -  50 [8]

We should start another category, "Messages from Tim rejected for bogus
reasons" <wink>.

> [1] this includes mail accidentally sent to eg. giudo@python.org,
>     but based on scanning the reject logs, I'd say the vast majority
>     is spam.  However, such messages are rejected after RCPT TO,
>     so we never see the message itself.  Most of the bad recipient
>     addrs are either ancient (ipc6@python.org,
>     grail-feedback@python.org) or fictitious (success@python.org,
>     info@python.org).
>
> [2] sender verification failed, eg. someone tried to claim an
>     envelope sender like foo@bogus.domain.  Usually spam, but innocent
>     bystanders can be hit by DNS servers suddenly exploding (hello,
>     comcast.net).  This only includes hard failures (DNS "no such
>     domain"), not soft failures (DNS timeout).
>
> [3] I'd be leery of accepting mail that's trying to hijack
>     mail.python.org as an open relay, even though that would
>     be a goldmine of spam.  (OTOH, we could reject after the
>     DATA command, and save the message anyways.)
>
> [4] mail.python.org rejects any message with a properly MIME-encoded
>     subject using any of the following charsets:
>       big5, euc-kr, gb2312, ks_c_5601-1987
>
> [5] includes viruses as well as spam (and no doubt some innocent
>     false positives, although I have added exemptions for the MUA/MTA
>     combinations that most commonly result in legit mail reaching
>     mail.python.org without a Message-Id header, eg. KMail/qmail)
>
> [6] eg. "To: all my friends" or "From: <>"
>
> [7] no valid sender address in any header line -- eg. someone gives a
>     valid MAIL FROM address, but then puts "From: blah@bogus.domain"
>     in the headers.  Easily defeated with a "Sender" or "Reply-to"
>     header.
>
> [8] any message scoring >= 10.0 is rejected at SMTP time; any
>     message scoring >= 5.0 but < 10 is saved in /var/mail/spam
>     for later review

Greg, you show signs of enjoying this job too much <wink>.

> Executive summary:
>
>   * it's a good thing we do all those easy checks before involving
>     SA, or the load on the server would be a lot higher

So long as easy checks don't block legitimate email, I can't complain about
that.

>   * give me 10 days of spam-harvesting, and I can equal Bruce
>     Guenter's spam archive for 2002.  (Of course, it'll take a couple
>     of days to set the mail server up for the harvesting, and a couple
>     more days to clean through the ~2000 caught messages, but you get
>     the idea.)

If it would be helpful for me to do research on corpora that include the
headers, then the point would be to collect both spam and non-spam messages,
so that they can be compared directly to each other.  Those should be as
close to the bytes coming off the pipe as possible (e.g., before injecting
new headers of our own).  As is, I've had to throw the headers away in both
corpora, so am, in effect, working with a crippled version of the algorithm.

Or if someone else is doing research on how best to tokenize and tag
headers, I'm not terribly concerned about merging the approaches untested.
If the approach is valuable enough to deploy, we'll eventually see exactly
how well it works in real life.

> ...
> Perhaps that spam-harvesting run should also set aside a random
> selection of apparently-non-spam messages received at the same time.
> Then you'd have a corpus of mail sent to the same server, more-or-less
> to the same addresses, over the same period of time.

Yes, it wants something as close to a slice of real life as possible, in all
conceivable respects, including ratio of spam to not spam, arrival times,
and so on.

> Oh, any custom corpus should also include the ~300 false positives and
> ~600 false negatives gathered since SA started running on
> mail.python.org in April.

Definitely not.  That's not a slice of real life, it's a distortion based on
how some *other* system screwed up.  Train it systematically on that, and
you're not training it for real life.  The urge to be clever is strong, but
must be resisted <0.3 wink>.

What would be perfectly reasonable is to run (not train) the system against
those corpora to see how it does.

BTW, Barry said the good-message archives he put together were composed of
msgs archived after SpamAssassin was enabled.  Since about 80% of the 1%
"false positive" rate I first saw turned out to be blatant spam in the ham
corpus, this suggests SpamAssassin let about 160000 * 1% * 80% = 12800 spams
through to the python-list archive alone.  That doesn't jibe with "600 false
negatives" at all.  I don't want to argue about it, it's just fair warning
that I don't believe much that I hear <wink>.  In particular, in *this* case
I don't believe python-list actually got 160000 messages since April, unless
we're talking about April of 2000.