[Spambayes] Re: Training oddity/confusion

Mon Jan 10 20:37:47 CET 2005

On Mon, 10 Jan 2005 15:10:08 +1300, "Tony Meyer" <tameyer at ihug.co.nz> wrote:

>> I was experimenting with training last night (Outlook plugin 
>> v1.01) and noticed something odd: sometimes, when I trained a 
>> single message as spam and then ran a filtering pass over the 
>> whole spam collection, a few other spams ended up with a 
>> lower score than before.
>> 
>> Maybe I'm completely misunderstanding how the classifier 
>> works, but shouldn't train-as-spam increase spam token counts 
>> and thus make all messages containing those tokens appear 
>> more spammy, and have no effect on those that don't?
>
>Part of the calculation of a token's probability is:
>
>(spamcount and hamcount are for the token. nham and nspam are totals)
>
>        hamratio = hamcount / nham
>        spamratio = spamcount / nspam
>        prob = spamratio / (hamratio + spamratio)
>        [the bayesian adjustment follows, but isn't important here]

Ah, of course. I was looking at the token counts and not thinking about the
total message counts. Stupid weekend brain. :)

I was doing a kind of manual "train to exhaustion", and the other thing I
noticed was that the spam took a lot more training to make classification
accurate (currently 82 ham : 409 spam, out of a total training set of 644 :
1414). I guess this simply means that my spam is a lot less consistent than
my ham.

BTW, I also found a trick in Outlook to be able to train on a given spam
more than once, to force correct classification. Normally this doesn't work
because the plugin sees the two messages as identical, but creating the copy
in an IMAP folder seems to fool it.

-- Mat.