[spambayes-dev] imbalance within ham or spam training sets?
Skip Montanaro
skip at pobox.com
Mon Nov 3 15:47:04 EST 2003
>> Let me rephrase the question again. There's a discussion in Gary
>> Robinson's LJ article
>>
>> http://www.linuxjournal.com/article.php?sid=6467
>>
>> about dealing with rare words which I didn't really follow.
alex> It's talking about the math behind unknown_word_strength and
alex> unknown_word_prob.
>> If I've trained on 1000 other ham messages and now encounter a
>> woodworking message, some of the words in there are likely to have
>> not been seen before ("lathe", for example). Such words obviously
>> can't contribute to scoring that message. Let's assume I then train
>> that message as ham. "lathe" now has a hamcount of 1 and a spamcount
>> of 0. It is a "rare word". How many more messages which contain
>> "lathe" do I have to train on before it is no longer "rare".
alex> A word is not "rare" or "not rare" according to the
alex> classifier...
I understand that it's not a binary thing. I used that term because Gary
used it in his article.
I seem to be having trouble making my ideas understood today... Was my
exposition that vague?
>> If there is a problem, it might be fairly easy to fall into a trap
>> which is a bit difficult to get out of.
alex> Lucky for us, there is no problem here. ;-)
That's all I was asking.
Skip
More information about the spambayes-dev
mailing list