[Spambayes] More on 'Spammer Attempts to CircumventBayesianFilter'

Mon Jul 19 21:28:51 CEST 2004

Thank you, Kenny, for taking my flight of fancy seriously.  That's another good
explanation.

Demonstrating how little I know, I want to take one more stab at this.

I postulate that a long, flowing narrative will have lots of neutral words, but,
over a large enough user base -- and that is the key -- will have more of what
the mass of users considers ham-tending words than spam-tending words (even
though, in the case of particular users, the opposite doubtless would be the
case).

I'm speculating that, over the course of a large quantity of spam and a large
quantity of ham, fewer tokens show up as spam in most user's evenly trained
databases than show up as ham.  Put another way, even allowing for differences
in user preference and experience, I'm guessing that the dictionary of spam
tokens is smaller than the dictionary of ham tokens.

Actually, that somewhat smaller "dictionary" probably works both ways, but I
still theorize that the larger number of likely ham words over the universe of
users will result in longer narratives being somewhat more likely to be judged
ham.

I completely agree about the random gibberish.

Thank you again.

Rich Barger

---

Kenny Pitt wrote:

> Richard B Barger ABC APR wrote:
> > One more thought:  It would intuitively seem that a longer, flowing
> > text narrative from a spammer would be slightly more likely to
> > include neutral and ham words than spam words.  I won't attempt to do
> > math on this, and there are probably lots of theoretical and
> > practical reasons why I'm wrong, but my gut tells me that, the longer
> > and more coherent the narrative, the more likely it would be to score
> > as ham.
>
> You are probably quite correct on this in general, but as usual it depends on
> your personal training data.  The random gibberish that spammers sometimes
> insert to fool hashing filters such as SpamNet has proven completely
> ineffective at fooling SpamBayes.  Random nonsense isn't going to appear hammy
> to anyone.  A narrative, on the other hand, will depend on how similar the
> topic is to something that you typically discuss in your e-mail. For me, text
> taken from a political news story would probably be far less likely to appear
> hammy than an excerpt from a computer mag or a sci-fi novel.  For others, it
> would probably be exactly the opposite.
>
> --
> Kenny Pitt