[Spambayes] X-Spambayes-Exception in classifier.py
shuntim.luk at polyu.edu.hk
Wed Jul 13 08:41:33 CEST 2005
Meyer, Tony wrote:
>>>If so, then this should happen - if you train the message twice,
>>>then all the tokens for the message will be incremented twice and
>>>the total count should be incremented twice.
>>If so then how can one tell how many *distinct* massage are
>>actually trained? It may be a little confusing if people try
>>to use this information to follow the recommendation of
>>"number of ham and spam of equal order".
> SpamBayes doesn't care whether the ham or spam you train on are distinct
> or not. It's the total number of messages, not distinct messages, that
> counts. If you train on 500 copies of the same 2 ham and spam messages,
> then the math will work fine (but of course, it'll only be any good at
> recognising those two messages).
Now I begin to understand.
You example would translate to the highly unlikely situation where the
user receives only one mail of each all the time. To go further by
paraphrasing your example, accidentally training 500 copies of 1 spam
and 500 hundred (roughly distinct) ordinary ham mails will result in
highly unbalanced "knowledge" of the mails that one receives. (Of
course, in real situations, you do get repeated mails.) But then in this
respect, knowing that you don't get into this kind of extreme situation
when you train is nevertheless useful. On the other hand, you can
tweak/weight the filter by training either ham or spam a couple of times
Thanks very much again. I still have much to learn.
More information about the Spambayes