[Spambayes] training problem?
tim.one at comcast.net
Tue Dec 2 15:53:16 EST 2003
> SpamBayes will use at most 150 tokens to determine the spam
> probability, while the complete message has 684. SpamBayes chooses
> the 150 strongest tokens (i.e. those with probabilities farthest from
> a neutral 0.5), and the rest are not used so are only shown in the
> Message Tokens section.
That's right. Note that this 150 is the default value of the Classifier's
max_discriminators option. Setting it much higher than that can cause
numerical problems in the inverse chi-squared probability computation,
specifically at the
# XXX If x2 is very large, exp(-m) will underflow to 0.
comment in chi2Q(). Testing showed that the exact value of
max_discriminators didn't matter much, provided it was at least 30 (or so).
Then again, most emails don't have 150 tokens, let alone 150 strong ones.
More information about the Spambayes