[Spambayes] More on 'Spammer Attempts to CircumventBayesianFilter'

Mon Jul 19 20:48:57 CEST 2004

What a terrific explanation, Kenny!

As I mentioned to Tony, you folks think of everything!

And yes, you probably have analyzed my situation correctly, except for one
thing:

Because of my huge volume of email, I certainly do not train on everything.  I
only train on mistakes and some of the unsures.  (If I get 11 identical unsures
in a row, I certainly don't train on all of them.  And, from earlier help in
this discussion group, I learned that SpamBayes was classifying messages sent to
my several accounts differently; now that I know this, I selectively train,
based on the addressee, to "even things out.")

One more thought:  It would intuitively seem that a longer, flowing text
narrative from a spammer would be slightly more likely to include neutral and
ham words than spam words.  I won't attempt to do math on this, and there are
probably lots of theoretical and practical reasons why I'm wrong, but my gut
tells me that, the longer and more coherent the narrative, the more likely it
would be to score as ham.

Very helpful, Kenny.  Thank you.

Rich Barger
Kansas City

---

Kenny Pitt wrote:

> Richard B Barger ABC APR wrote:
> > As a writer, editor, avid reader, and participant in 15 discussion
> > groups, I receive many narratives on many topics, and, except for
> > isolated words and the nonsense characters the spammer has put at the
> > end of my example, there is nothing in such text that sounds or looks
> > particularly different from my normal message stream.
> >
> > In general, I'd think that such neutral text would tend to lower a
> > message's spam probability, and the effect of one or a few suspect
> > words would be insignificant.  Even if the word "baseball" rated at
> > 300 percent likely to be spam <g>, the rest of the more "normal"
> > words would, it seems to me, offset the spammy effect of "baseball"
> > or other seldom-seen general words.
>
> Truly neutral text (where neutral is defined as having a spam prob between 0.4
> and 0.6) is discarded by SpamBayes when calculating the final spam score.  A
> word that has never been seen before will receive a spam prob of 0.5, so it
> will not be considered in the scoring.
>
> This is what prevents random text from causing problems for most people.  It
> is difficult for the spammer to come up with a list of words that you have
> seen before and trained on as hammy.  For some people such as yourself who
> communicate on an unusually wide variety of topics, it is probably more likely
> for the spammer to stumble upon text that appears hammy based on your training
> data.  This may also be exaggerated by the fact that you appear to train on
> all messages, not just the ones that were not identified correctly (although
> long-term effects of different training methods have not been proven).
>
> All non-neutral probabilities are combined using a statistical formula that
> effectively gives more weight to more extreme probabilities (closer to 1 or
> 0).  In your "baseball" example, the single highly-spammy word would need
> several somewhat-hammy words to balance it out.  On the other hand, if a
> spammer manages to hit a single highly-hammy word while all the rest of his
> random text scores as neutral and is discarded, it's possible for the
> highly-hammy word to have a significant impact on the overall score.
>
> --
> Kenny Pitt