[spambayes-dev] Wittel/Wu article on statistical attacks

Thu Sep 16 09:17:46 CEST 2004

> Has anyone investigated the attack methods outlined in the
> Wittel/Wu paper at the CEAS conference:
> 
>     http://ceas.cc/papers-2004/170.pdf
> 
> It's not obvious to me why SpamBayes should have performed as
> poorly as the authors indicated.  In particular, they were 
> adding common dictionary words which should have just added 
> non-extreme words which should have for the most part been 
> ignored (spamprobs between 0.4 and 0.6).

I had a little look at this when I first read the paper, but haven't had a
chance to have a proper look at it.

Concerns that I have:

 * It seems (it's not clear) that they did train-on-everything, which isn't
great, particularly (I think) for this type of spam.

 * Mixed-corpus testing is not a good idea, and it appears that that's
what's done here.

 * There's only a from and subject header in the base test message.  That's
losing a *lot* of header info.

 * The list of common English words is "slightly modified by removing spammy
words".  This means it's actually a list of words that they feel are hammy
or neutral.  It's hard to know how this effects it.

Attached is a script I wrote to try and duplicate the test.  I'm running
this at the moment, but it's taking a while (I didn't write it for speed!),
so I'll post results when I have them.  If they do match Wittel/Wu, then I
might have a look to see if different training methods have an effect or
not.

Suggestions for improvements in the script (or errors!) are welcome, of
course.  There are a few hard-coded locations, but it should be simple
enough to make sense of.

=Tony Meyer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wittelwu.py
Type: application/octet-stream
Size: 4823 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040916/1aae42d5/wittelwu.obj