[Spambayes] Result of a test

Tim Peters tim.one@comcast.net
Fri, 04 Oct 2002 01:11:58 -0400


>> prob('powernews,') = 0.77651
>> prob('powernews.') = 0.77651

BTW, it's impossible under Gary's probability adjustment (provided you stick
to the default "unknown word prob" of 0.5) for a spamprob to move "to the
other side" of 0.5 than the probability-by-counting estimate was (this
wasn't true when we were using Paul's prob calculations:  there it was
possible for a word to be a ham indicator even if it appeared more often in
spam(!)).

So that tells me that these variants of "powernews" *did* appear more often
in spam than in ham in the training data.  But that's a very unlikely word,
and it shows up routinely in all the "APC PowerNews" false positives papadoc
reported.  This very strongly suggests that the spam in that collection is
polluted with ham, and specifically that some APC PowerNews newsletters were
incorrectly classified as spam in the training data.  This would go a long
way toware explaining why the "APC PowerNews" false positives got such
extremely high scores (if the system was fed some and *told* they were spam,
it believes you <wink>).