[Spambayes] training problem?

Tim Peters tim.one at comcast.net
Tue Dec 2 15:53:16 EST 2003

[Kenny Pitt]
> SpamBayes will use at most 150 tokens to determine the spam
> probability, while the complete message has 684.  SpamBayes chooses
> the 150 strongest tokens (i.e. those with probabilities farthest from
> a neutral 0.5), and the rest are not used so are only shown in the
> Message Tokens section.

That's right.  Note that this 150 is the default value of the Classifier's
max_discriminators option.  Setting it much higher than that can cause
numerical problems in the inverse chi-squared probability computation,
specifically at the

    # XXX If x2 is very large, exp(-m) will underflow to 0.

comment in chi2Q().  Testing showed that the exact value of
max_discriminators didn't matter much, provided it was at least 30 (or so).
Then again, most emails don't have 150 tokens, let alone 150 strong ones.

