[spambayes-dev] Correcting training
tim.one at comcast.net
Mon Sep 8 13:09:56 EDT 2003
>> I really wonder what's going on here! Since the very first
>> tests I ran last year, Nigerian scams have been absolutely
>> nailed for me.
> Actually, I've looked more closely at my current results, and I wasn't
> quite right. Some are nailed for me - 100% (rounded), but others are
> solidly unsure (50%ish).
> An example of an unsure is below.
Heh. This message of yours (the whole thing, including your commentary) was
a false positive for me because of the Nigerian Scam content, and was indeed
the worst false positive I've ever gotten:
Spam Score: 98% (0.98087)
> It looks like it just has too many clues of each type; although there are
> almost twice as many spam clues as ham, the ham ones are lower. (The
> astute will notice that this is with the uni/bigrams classifier change
> discussed a couple of weeks ago, but I'm pretty sure I got these results
> without those changes, too).
That's apples and oranges. Throwing bigrams into the mix at least doubles
the number of distinct features spambayes finds in a message, and it found
so many for you that you're suffering a form of the dreaded old
"cancellation disease": spambayes found so many strong features in your
message that it artificially cut off the clues it looked at to the 150
strongest. That's why your sorted-by-spamprob clue list leaps from the
quite hammy 0.09 'seem':
> 'seem' 0.0909548 45 49
to the quite spammy 0.91 "mr. robert":
> 'mr. robert' 0.908163 0 2
with nothing between them. It's possible that max-clues cutoff should be
raised when using a mix of unigrams and bigrams. Or maybe you'd still
suffer cancellation disease (== a lot of strong ham clues and a lot of
strong spam cluea; chi-combining at least rates msgs like that Unsure
instead of (in effect) flipping a coin).
More information about the spambayes-dev