[spambayes-dev] Correcting training

Tim Peters tim.one at comcast.net
Mon Sep 8 13:09:56 EDT 2003

>> I really wonder what's going on here!  Since the very first
>> tests I ran last year, Nigerian scams have been absolutely
>> nailed for me.

[Tony Meyer]
> Actually, I've looked more closely at my current results, and I wasn't
> quite right.  Some are nailed for me - 100% (rounded), but others are
> solidly unsure (50%ish).
> An example of an unsure is below.

Heh.  This message of yours (the whole thing, including your commentary) was
a false positive for me because of the Nigerian Scam content, and was indeed
the worst false positive I've ever gotten:

Spam Score: 98% (0.98087)

'*H*'    0.038259
'*S*'    1

> It looks like it just has too many clues of each type; although there are
> almost twice as many spam clues as ham, the ham ones are lower.  (The
> astute will notice that this is with the uni/bigrams classifier change
> discussed a couple of weeks ago, but I'm pretty sure I got these results
> without those changes, too).

That's apples and oranges.  Throwing bigrams into the mix at least doubles
the number of distinct features spambayes finds in a message, and it found
so many for you that you're suffering a form of the dreaded old
"cancellation disease":  spambayes found so many strong features in your
message that it artificially cut off the clues it looked at to the 150
strongest.  That's why your sorted-by-spamprob clue list leaps from the
quite hammy 0.09 'seem':

> 'seem'                              0.0909548          45     49

to the quite spammy 0.91 "mr. robert":

> 'mr. robert'                        0.908163            0      2

with nothing between them.  It's possible that max-clues cutoff should be
raised when using a mix of unigrams and bigrams.  Or maybe you'd still
suffer cancellation disease (== a lot of strong ham clues and a lot of
strong spam cluea; chi-combining at least rates msgs like that Unsure
instead of (in effect) flipping a coin).

