[spambayes-dev] imbalance within ham or spam training sets?

Mon Nov 3 15:14:45 EST 2003

In message:  <16294.42960.302363.849243 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:
>
>Let me rephrase the question again.  There's a discussion in Gary Robinson's
>LJ article
>
>    http://www.linuxjournal.com/article.php?sid=6467
>
>about dealing with rare words which I didn't really follow.

It's talking about the math behind unknown_word_strength and
unknown_word_prob.

>If I've trained
>on 1000 other ham messages and now encounter a woodworking message, some of
>the words in there are likely to have not been seen before ("lathe", for
>example).  Such words obviously can't contribute to scoring that message.
>Let's assume I then train that message as ham.  "lathe" now has a hamcount
>of 1 and a spamcount of 0.  It is a "rare word".  How many more messages
>which contain "lathe" do I have to train on before it is no longer "rare".

A word is not "rare" or "not rare" according to the classifier... it's
not just a binary switch.  All words have their probabilities adjusted
towards unknown_word_prob by an amount determined by unknown_word_strength
and the number of trained messages in which the word has appeared.  The
more often the word has been seen (and trained), the smaller the adjustment.

The only way this could be a binary switch would be if the unknown word
adjustments were strong enough to pull the probability for a word inside
the .4-.6 range (assuming default settings) that the classifier outright
ignores... but the default settings for unknown_word_* aren't that strong.
I seem to recall that the hapax values (from only a single instance trained)
are around .31 and .69 for ham and spam respectively.

>In particular, by training on 1000 other hams which don't contain that word,
>have I somehow created an artificial barrier to getting woodworking-specific
>words to have full effect as ham indicators?

No.  Training on other mail which does not contain the word does not
affect the score for a word at all (unless you have the experimental
ham/spam imbalance adjustment enabled and it's actually doing something...
and you specifically engineered for question to make the imbalance
adjustment moot).

>If there is a problem, it might be fairly easy to fall into a trap which is
>a bit difficult to get out of.

Lucky for us, there is no problem here. ;-)

- Alex