[Spambayes] Re: [SAdev] fully-public corpus of mail available

Daniel Quinlan quinlan@pathname.com
09 Oct 2002 21:47:02 -0700


> (Please feel free to forward this message to other possibly-interested
> parties.)

Some caveats (in decending order of concern):

1. These messages could end up being falsely (or incorrectly) reported
   to Razor, DCC, Pyzor, etc.  Certain RBLs too.  I don't think the
   results for these distributed tests can be trusted in any way,
   shape, or form when running over a public corpus.

2. These messages could also be submitted (more than once) to projects
   like SpamAssassin that rely on filtering results submission for GA
   tuning and development.

3. Spammers could adopt elements of the good messages to throw off
   filters.  And, of course, there's always progression in technology
   (by both spammers and non-spammers).

The second problem could be alleviated somewhat by adding a Nilsimsa
signature (or similar) to the mass-check file (the results format used
by SpamAssassin) and giving the message files unique names (MD5 or
SHA-1 of each file).

The third problem doesn't really worry me.

These problems (and perhaps others I have not identified) are unique
to spam filtering.  Compression corpuses and other performance-related
corpuses have their own set of problems, of course.

In other words, I don't think there's any replacement for having
multiple independent corpuses.  Finding better ways to distribute
testing and collate results seems like a more viable long-term solution
(and I'm glad we're working on exactly that for SpamAssassin).  If
you're going to seriously work on filter development, building a corpus
of 10000-50000 messages (half spam/half non-spam) is not really that
much work.  If you don't get enough spam, creating multi-technique
spamtraps (web, usenet, replying to spam) is pretty easy.  And who
doesn't get thousands of non-spam every week?  ;-)

Dan