[Spambayes] Training oddity/confusion

Mon Jan 10 03:10:08 CET 2005

> I was experimenting with training last night (Outlook plugin 
> v1.01) and noticed something odd: sometimes, when I trained a 
> single message as spam and then ran a filtering pass over the 
> whole spam collection, a few other spams ended up with a 
> lower score than before.
> 
> Maybe I'm completely misunderstanding how the classifier 
> works, but shouldn't train-as-spam increase spam token counts 
> and thus make all messages containing those tokens appear 
> more spammy, and have no effect on those that don't?

Part of the calculation of a token's probability is:

(spamcount and hamcount are for the token. nham and nspam are totals)

        hamratio = hamcount / nham
        spamratio = spamcount / nspam
        prob = spamratio / (hamratio + spamratio)
        [the bayesian adjustment follows, but isn't important here]

The key bit here is that if you train on more spam, then nspam increases.
For tokens that aren't in the newly trained message, this will decrease
spamratio, which decreases the probability overall (since hamratio will stay
constant).

Consider this example: you have two trained spam messages, both with the
tokens "free" and "money".  You then train on 100 more spam messages, all of
which have the token "free", but none of which have the token "money".
"money" should now be much less of a spam clue than it used to be, whereas
"free" should be just as strong as it has still been in every message.

Does this make sense?  Tim's much better at explaning the stats stuff than
me, but I believe all of that is true, at least :)

=Tony.Meyer