[Spambayes] CRM114 in November breaks 99.9%. :-)
Bill Yerazunis
wsy@merl.com
Mon Dec 2 14:44:10 2002
Final test statistics for CRM114 for November are in:
Standard rules apply (no whitelists, no blacklists, realtime email stream
only (no "canned spam"), train only on errors, polynomial length 5)
For All of November (starting 9 AM Nov 1, ending 9 AM Dec 1)
Spams Nonspams False False Total N+1 Accuracy NHC's
Accepts Rejects Emails
1993 3914 4 0 5911 99.915 2
Spam features in hash tables: 398K
Nonspam features in hash tables: 299K
There was just 1 spam that got through in the last week of November-
a very strange spam written in mixed English and Czech trying to sell
me diesel engine parts. It came through on a moto-head email list,
which I suppose might be slightly topical, and it certainly was amusing,
rather reminiscent of the Monty Python "camshaft smuggling" skit,
but it's still spam and counts as such.
This gives an N+1 accuracy of > 99.9% for the entire month of November.
(99.932% for N-accuracy).
So, CRM114 barely squeaked through the month at >99.9%. Barely. There's
clearly still work to be done (the spambayes mailing list is kicking
around the proper way to evaluate probabilities; I'm looking into some
of their ideas as well.)
--- On The Other Hand (the bad news)---
December is looking much worse - TWO have gotten through already over
the weekend (one "barnyard teen" pornspam- it hasn't seen that before)
and one very short mortgage solicitation, written folksy-style.
I'm also getting mailer errors now out of Sendmail whenever I do
a "learn"; I'm starting to think that our systems people have
upgraded something and broken something else in the process. This
throws some question onto whether the CRM114 training code is actually
getting run at all, or whether the increasing spam rate is
symptomatic of the evolution of spam against static filters.
-Bill Yerazunis
More information about the Spambayes
mailing list