... One thing I think would be worthwhile would be to run GBayes first, then only run stuff it thought was spam through SpamAssassin. Only messages that both systems categorized as spam would drop into the spam folder. This has a couple benefits over running one or the other in isolation:
* The training set for GBayes probably doesn't need to be as big
Training GBayes is cheap, and the more you feed it the less need to do information-destroying transformations (like folding case or ignoring punctuation).
* The two systems use substantially different approaches to identifying spam,
Which could indeed be a killer-strong benefit.
so I suspect your false positive rate would go way down.
I'm already having a real problem with this just looking at content: the false positive rate is already so low that I can't make statistically significant conclusions about things that may improve it (e.g., if I do something that removes just *one* false positive in a test run on 4000 hams, the false-positive rate falls by 12.5% -- I don't have enough false positives to make fine-grained judgments. And, indeed, every time I test a change to the algorithm, the most *significant* thing I find is that it turns up another class of blatant spam hiding in the ham corpus: my training data is still too dirty, and cleaning it up is labor-intensive).
False negatives would go up, but only testing can suggest by how much. * Since SA is dog slow most of the time, SA users get a big speedup, since a substantially smaller fraction of your messages get run through it.
This sort of chaining is pretty trivial to setup with procmail. Dunno what the Windows set will do though.
There are different audiences here. Greg is keen to have a better approach for python.org as a whole, while Barry is keen about that and about doing something more generic for Mailman. Windows isn't an issue for either of those. Everyone else can eat cake <wink>.