[spambayes-dev] imbalance within ham or spam training sets?

Skip Montanaro skip at pobox.com
Mon Nov 3 16:08:03 EST 2003


    Kenny> The important thing to note with respect to your original
    Kenny> concerns, though, is that this "rare" word calculation is
    Kenny> entirely independent of any other tokens in the training data.
    Kenny> The calculation involves the original straight probability, the
    Kenny> fixed factors of s and x, and the total number of occurrences of
    Kenny> that token in both ham and spam.  There is no fixed cutoff that
    Kenny> says a word is no longer rare, but neither does the definition of
    Kenny> rare depend on the relative numbers compared to any other token
    Kenny> in the training data.

Thanks, this is what I was getting at.

Skip



More information about the spambayes-dev mailing list