[spambayes-dev] Wittel/Wu article on statistical attacks

Kenny Pitt kennypitt at hotmail.com
Thu Sep 9 21:04:29 CEST 2004


Skip Montanaro wrote:
> Has anyone investigated the attack methods outlined in the Wittel/Wu
> paper at the CEAS conference: 
> 
>     http://ceas.cc/papers-2004/170.pdf
> 
> It's not obvious to me why SpamBayes should have performed as poorly
> as the authors indicated.  In particular, they were adding common
> dictionary words which should have just added non-extreme words which
> should have for the most part been ignored (spamprobs between 0.4 and
> 0.6).    

I'd like to try to write a script to run a similar test against my current
training data, although I'm not sure when I'll find the time. <0.5 wink>

A couple of thoughts, though.  As we all know, the accuracy of SpamBayes is
controlled entirely by the training data used.  It seems likely that 3000
ham messages from a public corpus would contain many more common words than
an individual user's typical mail stream, especially if the user is doing
train on mistakes instead of train on everything.  I also wonder how many of
the 3000 spam messages they trained on were already doing random word
insertion?  I would expect that once enough spam messages start randomizing
with common words, those common words would quickly become neutral or even
spammy with continued training.

The "picospam" they use for testing has also been stripped of almost all
header information.  Since any spam must pass through the SMTP mailer chain
before it can be received by a user, I wonder how much difference the
Received header information would have made in the classification.  This
also leads to the question of what parsing options they used for SpamBayes.
I suppose that we can assume that they left everything at the defaults.  How
much effect would some of the advanced parsing options, particularly
bigrams, have had on the results?

-- 
Kenny Pitt



More information about the spambayes-dev mailing list