[Python-Dev] The first trustworthy <wink> GBayes results
Fri, 30 Aug 2002 12:45:44 -0400
> One thing I think would be worthwhile would be to run GBayes first, then
> only run stuff it thought was spam through SpamAssassin. Only
> messages that both systems categorized as spam would drop into the spam
> folder. This has a couple benefits over running one or the other in
> * The training set for GBayes probably doesn't need to be as big
Training GBayes is cheap, and the more you feed it the less need to do
information-destroying transformations (like folding case or ignoring
> * The two systems use substantially different approaches to
> identifying spam,
Which could indeed be a killer-strong benefit.
> so I suspect your false positive rate would go way down.
I'm already having a real problem with this just looking at content: the
false positive rate is already so low that I can't make statistically
significant conclusions about things that may improve it (e.g., if I do
something that removes just *one* false positive in a test run on 4000 hams,
the false-positive rate falls by 12.5% -- I don't have enough false
positives to make fine-grained judgments. And, indeed, every time I test a
change to the algorithm, the most *significant* thing I find is that it
turns up another class of blatant spam hiding in the ham corpus: my
training data is still too dirty, and cleaning it up is labor-intensive).
> False negatives would go up, but only testing can suggest by how
> * Since SA is dog slow most of the time, SA users get a big speedup,
> since a substantially smaller fraction of your messages get run
> through it.
> This sort of chaining is pretty trivial to setup with procmail.
> Dunno what the Windows set will do though.
There are different audiences here. Greg is keen to have a better approach
for python.org as a whole, while Barry is keen about that and about doing
something more generic for Mailman. Windows isn't an issue for either of
those. Everyone else can eat cake <wink>.