[spambayes-dev] imbalance within ham or spam training sets?
kennypitt at hotmail.com
Mon Nov 3 15:23:14 EST 2003
Skip Montanaro wrote:
> Let me rephrase the question again. There's a discussion in Gary
> Robinson's LJ article
> about dealing with rare words which I didn't really follow. If I've
> trained on 1000 other ham messages and now encounter a woodworking
> message, some of the words in there are likely to have not been seen
> before ("lathe", for example). Such words obviously can't contribute
> to scoring that message. Let's assume I then train that message as
> ham. "lathe" now has a hamcount of 1 and a spamcount of 0. It is a
> "rare word". How many more messages which contain "lathe" do I have
> to train on before it is no longer "rare". In particular, by training
> on 1000 other hams which don't contain that word, have I somehow
> created an artificial barrier to getting woodworking-specific words
> to have full effect as ham indicators?
OK, I see where you're coming from. I answered a related (albeit much
simpler <wink>) question for someone on the Spambayes list not long ago.
The "rare word" adjustment is a way of adjusting the contributed
probability for words that haven't been seen very often. In your
example of "lathe" with ham=1 and spam=0, the straight probability of
spam [spam / (spam + ham)] would be 0.0, but one occurrence doesn't make
it the most reliable indicator. SpamBayes adjusts this using the
"unknown_word_strength" (s in the Robinson article) and
"unknown_word_prob" (x in the article) options. You can see the
adjustment calculation in the probability() function in classifier.py.
The default for these options in Options.py are s=0.45 and x=0.5. Using
these defaults with the case of 1 ham and no spam, the actual
probability contributed to the chi2 combining is 0.155172. As the total
number of occurrences of the token increases, the contributed
probability gets closer and closer to the straight probability. So, for
ham=5 and spam=0, contributed probablity is 0.041284; for ham=10 and
spam=0, contributed probability is 0.021531; and for ham=50 and spam=0,
contributed probability is 0.004460. As you can see, the probability
moves back toward the straight probability fairly quickly.
The important thing to note with respect to your original concerns,
though, is that this "rare" word calculation is entirely independent of
any other tokens in the training data. The calculation involves the
original straight probability, the fixed factors of s and x, and the
total number of occurrences of that token in both ham and spam. There
is no fixed cutoff that says a word is no longer rare, but neither does
the definition of rare depend on the relative numbers compared to any
other token in the training data.
More information about the spambayes-dev