[spambayes-dev] Wittel/Wu article on statistical attacks
Tony Meyer
tameyer at ihug.co.nz
Thu Sep 16 09:17:46 CEST 2004
> Has anyone investigated the attack methods outlined in the
> Wittel/Wu paper at the CEAS conference:
>
> http://ceas.cc/papers-2004/170.pdf
>
> It's not obvious to me why SpamBayes should have performed as
> poorly as the authors indicated. In particular, they were
> adding common dictionary words which should have just added
> non-extreme words which should have for the most part been
> ignored (spamprobs between 0.4 and 0.6).
I had a little look at this when I first read the paper, but haven't had a
chance to have a proper look at it.
Concerns that I have:
* It seems (it's not clear) that they did train-on-everything, which isn't
great, particularly (I think) for this type of spam.
* Mixed-corpus testing is not a good idea, and it appears that that's
what's done here.
* There's only a from and subject header in the base test message. That's
losing a *lot* of header info.
* The list of common English words is "slightly modified by removing spammy
words". This means it's actually a list of words that they feel are hammy
or neutral. It's hard to know how this effects it.
Attached is a script I wrote to try and duplicate the test. I'm running
this at the moment, but it's taking a while (I didn't write it for speed!),
so I'll post results when I have them. If they do match Wittel/Wu, then I
might have a look to see if different training methods have an effect or
not.
Suggestions for improvements in the script (or errors!) are welcome, of
course. There are a few hard-coded locations, but it should be simple
enough to make sense of.
=Tony Meyer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wittelwu.py
Type: application/octet-stream
Size: 4823 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040916/1aae42d5/wittelwu.obj
More information about the spambayes-dev
mailing list