[spambayes-dev] imbalance within ham or spam training sets?
Skip Montanaro
skip at pobox.com
Mon Nov 3 16:08:03 EST 2003
Kenny> The important thing to note with respect to your original
Kenny> concerns, though, is that this "rare" word calculation is
Kenny> entirely independent of any other tokens in the training data.
Kenny> The calculation involves the original straight probability, the
Kenny> fixed factors of s and x, and the total number of occurrences of
Kenny> that token in both ham and spam. There is no fixed cutoff that
Kenny> says a word is no longer rare, but neither does the definition of
Kenny> rare depend on the relative numbers compared to any other token
Kenny> in the training data.
Thanks, this is what I was getting at.
Skip
More information about the spambayes-dev
mailing list