[Python-Dev] Re: The first trustworthy <wink> GBayes results

Wed, 28 Aug 2002 09:27:16 -0400

On 27 August 2002, Tim Peters said:
> Setting this up has been a bitch.  All early attempts floundered because it
> turned out there was *some* systematic difference between the ham and spam
> archives that made the job trivial.
> 
> The ham archive:  I selected 20,000 messages, and broke them into 5 sets of
> 4,000 each, at random, from a python-list archive Barry put together,
> containing msgs only after SpamAssassin was put into play on python.org.
> It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs.  As
> will be seen below, it's not clean enough.

One of the other perennial-seeming topics on spamassassin-devel (a list
that I follow only sporodically) is that careful manual cleaning of your
corpus is *essential*.  The concern of the main SA developers is that
spam in your non-spam folder (and vice-versa) will prejudice the genetic
algorithm that evolves SA's scores in the wrong direction.  Gut instinct
tells me the Bayesian approach ought to be more robust against this sort
of thing, but even it must have a breaking point at which misclassified
messages throw off the probabilities.

But that's entirely consistent with your statement:

> Another lesson reinforces
> one from my previous life in speech recognition:  rigorous data collection,
> cleaning, tagging and maintenance is crucial when working with statisical
> approaches, and is damned expensive to do.

On corpus collection...

> The spam archive:  This is essentially all of Bruce Guenter's 2002 spam
> collection, at <http://www.em.ca/~bruceg/spam/>.  It was broken at random
> into 5 sets of 2,750 spams each.

One possibility occurs to me: we could build our own corpus by
collecting spam on python.org for a few weeks.  Here's a rough breakdown
of mail rejected by mail.python.org over the last 10 days,
eyeball-estimated messages per day:

  bad RCPT                       150 - 300 [1]
  bad sender                      50 - 190 [2]
  relay denied                    20 - 180 [3]
  known spammer addr/domain       15 -  60
  8-bit chars in subject         130 - 200
  8-bit chars in header addrs     10 -  60
  banned charset in subject        5 -  50 [4]
  "ADV" in subject                 0 -   5
  no Message-Id header           100 - 400 [5]
  invalid header address syntax    5 -  50 [6]
  no valid senders in header      10 -  15 [7]
  rejected by SpamAssassin        20 -  50 [8]
  quarantined by SpamAssassin      5 -  50 [8]

[1] this includes mail accidentally sent to eg. giudo@python.org,
    but based on scanning the reject logs, I'd say the vast majority
    is spam.  However, such messages are rejected after RCPT TO,
    so we never see the message itself.  Most of the bad recipient
    addrs are either ancient (ipc6@python.org,
    grail-feedback@python.org) or fictitious (success@python.org,
    info@python.org).

[2] sender verification failed, eg. someone tried to claim an
    envelope sender like foo@bogus.domain.  Usually spam, but innocent
    bystanders can be hit by DNS servers suddenly exploding (hello,
    comcast.net).  This only includes hard failures (DNS "no such
    domain"), not soft failures (DNS timeout).    

[3] I'd be leery of accepting mail that's trying to hijack
    mail.python.org as an open relay, even though that would
    be a goldmine of spam.  (OTOH, we could reject after the
    DATA command, and save the message anyways.)

[4] mail.python.org rejects any message with a properly MIME-encoded
    subject using any of the following charsets:
      big5, euc-kr, gb2312, ks_c_5601-1987

[5] includes viruses as well as spam (and no doubt some innocent
    false positives, although I have added exemptions for the MUA/MTA
    combinations that most commonly result in legit mail reaching
    mail.python.org without a Message-Id header, eg. KMail/qmail)

[6] eg. "To: all my friends" or "From: <>"

[7] no valid sender address in any header line -- eg. someone gives a
    valid MAIL FROM address, but then puts "From: blah@bogus.domain"
    in the headers.  Easily defeated with a "Sender" or "Reply-to"
    header.

[8] any message scoring >= 10.0 is rejected at SMTP time; any
    message scoring >= 5.0 but < 10 is saved in /var/mail/spam
    for later review

Executive summary:

  * it's a good thing we do all those easy checks before involving
    SA, or the load on the server would be a lot higher

  * give me 10 days of spam-harvesting, and I can equal Bruce
    Guenter's spam archive for 2002.  (Of course, it'll take a couple
    of days to set the mail server up for the harvesting, and a couple
    more days to clean through the ~2000 caught messages, but you get
    the idea.)

> + Mailman added distinctive headers to every message in the ham
>   archive, which appear nowhere in the spam archive.  A Bayesian
>   classifier picks up on that immediately.
> 
> + Mailman also adds "[name-of-list]" to every Subject line.

Perhaps that spam-harvesting run should also set aside a random
selection of apparently-non-spam messages received at the same time.
Then you'd have a corpus of mail sent to the same server, more-or-less
to the same addresses, over the same period of time.

Oh, any custom corpus should also include the ~300 false positives and
~600 false negatives gathered since SA started running on
mail.python.org in April.

        Greg