[spambayes-dev] Correcting training
Meyer, Tony
T.A.Meyer at massey.ac.nz
Tue Sep 9 14:43:33 EDT 2003
> Heh. This message of yours (the whole thing, including your
> commentary) was a false positive for me because of the Nigerian Scam
> content, and was indeed the worst false positive I've ever gotten:
>
> Spam Score: 98% (0.98087)
>
> '*H*' 0.038259
> '*S*' 1
:) I wondered if it might cause troubles. Still, 98% isn't that bad -
if you were automatically deleting all 100% messages, you'd still be
safe.
> That's apples and oranges. Throwing bigrams into the mix at least
> doubles the number of distinct features spambayes finds in a message,
> and it found so many for you that you're suffering a form of the
> dreaded old "cancellation disease":
[...]
> It's possible that max-clues
> cutoff should be raised when using a mix of unigrams and bigrams. Or
> maybe you'd still suffer cancellation disease
That explains a lot. I tried a very quick test with max_discriminators
at 300 and got:
Spam Score: 60% (0.600497)
'*H*' 0.799006
'*S*' 1
But a lot of clues were still missing (apart from the 0.4 to 0.6 ones).
So I upped it to 600, and got:
Spam Score: 93% (0.931465)
'*H*' 0.137069
'*S*' 1
Which is good enough for me.
With the unigrams only classifier, it gets:
Spam Score: 39% (0.390671)
'*H*' 1
'*S*' 0.781341
With the unigrams only classifier, and max_discriminators at 300 it does
really badly:
Spam Score: 0% (2.90264e-006)
'*H*' 1
'*S*' 5.79636e-006
I'll play around with this next time I get a chance to do some testing.
I've left my active copy of spambayes using the uni/bigram mix to see
how it goes in 'real life'. It's definitely noticably slower - if it
was to be included I think I'd have to take a good look at the code I
threw together and try and make it a lot more efficient.
=Tony Meyer
More information about the spambayes-dev
mailing list