[Spambayes] very impressed....

Skip Montanaro skip at pobox.com
Tue Jun 17 22:40:48 EDT 2003

    Tim> [Don Andrews]
    >> I get about 3 or 4 a day which are unsure and upon reviewing them I
    >> can understand the rating of Unsure since they could go either way.

    Tim> OTOH, that's suspiciously good performance <wink>.  An Unsure rate
    Tim> of under 1% isn't unheard of, but has rarely been seen in personal
    Tim> email.

On June 4th I cvs up'd to get Tim's latest tokenizer improvements and
retrained on my full corpus.  Here's how Spambayes has classified all my
mail since then:

    ham             8125        59%
    spam            5561        40%
    unsure           163         1%

My spam load is actually somewhat higher than that, but I have other filters
which eliminate messages (both ham and spam) with previously seen message
ids and spams with duplicate "loose" checksums before Spambayes sees them.

My ham and spam cutoffs are 0.15 and 0.80, respectively.  My procmailrc file
segregates spams into two levels, >= 0.97 ("high spam") and < 0.97 ("low
spam").  This allows me to pay closer attention to messages classified as
spam which are likely to be mistakes.  Of the high spams I don't recall
seeing any false positives.  This is the bulk of all spams, more than 90%
(5234 vs 327).  Furthermore, I tend to train on the lowest of the "low
spams", those which score less than 0.90.  This (I think) should tend to
push more spams into the high spam range.

I no longer train on everything.  Based on those unsures I did train on, 40
turn out to be hams and 68 spams.  Of those classified as spam or ham there
were two false positives and 13 false negatives.  I still have over 3000
high spams which I have yet to examine.  I'm fairly confident there aren't
any hams in that bunch.


More information about the Spambayes mailing list