[spambayes-dev] Correcting training

Tue Sep 9 14:43:33 EDT 2003

> Heh.  This message of yours (the whole thing, including your
> commentary) was a false positive for me because of the Nigerian Scam
> content, and was indeed the worst false positive I've ever gotten:
> 
> Spam Score: 98% (0.98087)
> 
> '*H*'    0.038259
> '*S*'    1

:)  I wondered if it might cause troubles.  Still, 98% isn't that bad -
if you were automatically deleting all 100% messages, you'd still be
safe.

> That's apples and oranges.  Throwing bigrams into the mix at least 
> doubles the number of distinct features spambayes finds in a message, 
> and it found so many for you that you're suffering a form of the 
> dreaded old "cancellation disease":
[...]
> It's possible that max-clues
> cutoff should be raised when using a mix of unigrams and bigrams.  Or
> maybe you'd still suffer cancellation disease

That explains a lot.  I tried a very quick test with max_discriminators
at 300 and got:

Spam Score: 60% (0.600497)
'*H*'    0.799006
'*S*'    1

But a lot of clues were still missing (apart from the 0.4 to 0.6 ones).
So I upped it to 600, and got:

Spam Score: 93% (0.931465)
'*H*'    0.137069
'*S*'    1

Which is good enough for me.

With the unigrams only classifier, it gets:

Spam Score: 39% (0.390671)
'*H*'    1
'*S*'    0.781341

With the unigrams only classifier, and max_discriminators at 300 it does
really badly:

Spam Score: 0% (2.90264e-006)
'*H*'    1
'*S*'    5.79636e-006

I'll play around with this next time I get a chance to do some testing.
I've left my active copy of spambayes using the uni/bigram mix to see
how it goes in 'real life'.  It's definitely noticably slower - if it
was to be included I think I'd have to take a good look at the code I
threw together and try and make it a lot more efficient.

=Tony Meyer