[Spambayes] More on 'Spammer Attempts to Circumvent BayesianFilter'

Kenny Pitt kennypitt at hotmail.com
Mon Jul 19 20:03:21 CEST 2004


Richard B Barger ABC APR wrote:
> As a writer, editor, avid reader, and participant in 15 discussion
> groups, I receive many narratives on many topics, and, except for
> isolated words and the nonsense characters the spammer has put at the
> end of my example, there is nothing in such text that sounds or looks
> particularly different from my normal message stream.    
> 
> In general, I'd think that such neutral text would tend to lower a
> message's spam probability, and the effect of one or a few suspect
> words would be insignificant.  Even if the word "baseball" rated at
> 300 percent likely to be spam <g>, the rest of the more "normal"
> words would, it seems to me, offset the spammy effect of "baseball"
> or other seldom-seen general words.     

Truly neutral text (where neutral is defined as having a spam prob between
0.4 and 0.6) is discarded by SpamBayes when calculating the final spam
score.  A word that has never been seen before will receive a spam prob of
0.5, so it will not be considered in the scoring.

This is what prevents random text from causing problems for most people.  It
is difficult for the spammer to come up with a list of words that you have
seen before and trained on as hammy.  For some people such as yourself who
communicate on an unusually wide variety of topics, it is probably more
likely for the spammer to stumble upon text that appears hammy based on your
training data.  This may also be exaggerated by the fact that you appear to
train on all messages, not just the ones that were not identified correctly
(although long-term effects of different training methods have not been
proven).

All non-neutral probabilities are combined using a statistical formula that
effectively gives more weight to more extreme probabilities (closer to 1 or
0).  In your "baseball" example, the single highly-spammy word would need
several somewhat-hammy words to balance it out.  On the other hand, if a
spammer manages to hit a single highly-hammy word while all the rest of his
random text scores as neutral and is discarded, it's possible for the
highly-hammy word to have a significant impact on the overall score.

-- 
Kenny Pitt



More information about the Spambayes mailing list