[Spambayes] training problem?

Tue Dec 2 15:04:36 EST 2003

Seth Goodman wrote:
> Attached are two similar spams that I trained on.  Since they are
> regular-looking newsletters that I can't succeed in opting out of,
> I'm not surprised that they look hammy to the classifier.  However,
> many of the tokens that the tokenizer found were *not* listed in the
> spam score section for either message, despite the fact that these
> tokens appears in both trained spam.  Conspicuously absent are the
> tokens 'subject:ADV', 'email addr:wsntv7511.com' and 'email
> name:info'.
[snip]
>
> Message Tokens:
> 
> 684 unique tokens

SpamBayes will use at most 150 tokens to determine the spam probability,
while the complete message has 684.  SpamBayes chooses the 150 strongest
tokens (i.e. those with probabilities farthest from a neutral 0.5), and
the rest are not used so are only shown in the Message Tokens section.
SpamBayes also ignores any tokens that don't have a probability <0.4 or
>0.6.

-- 
Kenny Pitt