[Spambayes] Big spam messages file you care to share?

Greg Ward gward@python.net
Thu, 12 Sep 2002 14:32:25 -0400


On 12 September 2002, Robert Oschler said:
> If anyone has a big file of spam and non-spam (for false positive
> detection) messages that they care to share, which means they don't
> contain any confidential data, I'd like to know about it.  I get about
> 3 spam messages a week and I haven't been archiving them.

Currently gathering one on mail.python.org.  Should be a doozy: it
currently has (after about 30 hours of gathering)

  3944 bounce messages  (20529 kB)
  1269 regular messages ( 3373 kB)
       (includes spam that snuck past SpamAssassin -- not much so far)
       (also includes bounces with a non-empty envelope sender)
   457 junk messages    (24707 kB)
       (includes spam, viruses, and SpamAssassin false positives)

The biggest artifact is the overwhelming preponderance of bounce
messages, but that's reality for mail.python.org.  (I bet "MAILER-DAEMON"
will be a popular ham word.  "qmail", "Exim", and "postfix" might also
be, since they pop up in bounce messages lot too.)

I'll have to do something about distinguishing spam, viruses, and other
junk mail in the final corpus.  And of course, there will be the usual
exercise of removing ham from the spam folder and vice-versa.

But I think this will be a pretty good corpus for training spam
detectors for python.org traffic, once I've gathered about 10 days'
worth of traffic.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
All the world's a stage and most of us are desperately unrehearsed.