[Spambayes] Upgrading from 1.0a2

Richie Hindle richie at entrian.com
Fri Dec 5 09:23:20 EST 2003


> I see Paul Graham quoting hitrates of 99.7% or
> whatever and I lust ;)

I don't know what my percentage hit rate is, but it's well over 99%.  To
give you an idea, I receive around 50 hams per day, and around 300 spams.
I get on average one false negative a day, and yesterday I had my first
false positive (a confirmation for registering a .NET passport, don't get
me started on Microsoft's support policies) in at least a month.

My database is 1292 spams and 646 hams.  I initially trained with a couple
of hundred of each, and have been mistake-based training ever since
(training on spams classified as ham or unsure, and on hams classified as
spam - though they're vanishingly rare, hence the imbalance).

There are enough people saying that in their experience an even ham/spam
balance improves results that I'm thinking I ought to train on some more
ham to redress the balance.

People have also reported both ways on whether you should train on
hundreds or tens of thousands of messages.  Do the easier one, if it's
unsatisfactory then do the harder one, if it gets worse start again with
the easier one.  There are no rules.

> I have copies of every email it was trained
> on, so a retrain would simply be a matter spending of half an hour or
> so writing a script.

You shouldn't need to write a script - if you can get your messages into
an mbox file or a Maildir directory then you train either through the web
interface or by using sb_mboxtrain.py or sb_filter.py.

