[Spambayes] The joys of error messages.

Tony Meyer tameyer at ihug.co.nz
Tue May 11 22:54:18 EDT 2004

> How come the answer to a corrupt database is always "retrain 
> from scratch"?

(a) Retraining is a trivial task, and it takes only a few messages to get
good accuracy again.  Assuming that people keep a little bit of mail around
(which we recommend), then retraining is very quick.  It's not like you lose
important information.

(b) There isn't any way (that we know of) to recover the database.  If you
could, then this would obviously be better.

> I know it's not spambayes' job to make 
> backups, but a simple weekly backup for people who don't 
> normally make backups wouldn't be too hard to fit into the 
> program, would it?

Some people (particularly *nix people) believe that a program should do only
one thing, and do that well.  Turning SpamBayes into a backup program isn't
really necessary.  If people want to back up their database, it's a trivial
thing to do, and should easily fit into their existing backup system.

An additional problem is that the DB_RUN_RECOVERY error doesn't necessarily
occur the very first time after the db is corrupted.  Testing shows that it
can be several accesses later.  This means that you can't simply use the
last good db, you'd have to have a whole lot of them.

Anyway, anyone that doesn't normally make backups either doesn't care about
their data, are naïve, or are stupid.  Sooner or later, data gets lost.

> I mean losing the email, the vacation pictures and all the 
> work is one thing, having to retrain spambayes... that would 
> be horrible!

Why?  It takes just a few moments to do, assuming you have mail sitting
around to train it on.  If not, then you just have a day or two when you get
a lot of mail in your unsure folder and have to be a bit more careful
looking for false positives/negatives.

> I don't know how common these corrupt databases are. Maybe 
> it's not an issue.

They are a lot less common than they once were.  We've managed to eliminate
the majority of the causes.  It appears that the main reasons left are
people doing things like having two processes access the db at once (not
supported), or killing a process in the middle of training.  It's difficult
to avoid corruption in those circumstances, but they should be rare.

Our goal, really (mine, at least), is to eliminate the remaining rare cases
of corruption, rather than force a backup system on everyone.  Of course,
this *is* open-source - you (or anyone else) is free to patch one in

The best solution (it seems) is to move away from the bsddb-based dbm that
is currently the default, or to move to a proper transactional system using
that dbm (which would allow recovery).  This is a lot of work, though, and
ideally requires someone with lots of experience working with databases.
For the moment, there are other options - if you run from source, you can
use a pickle, mysql, pgsql, or (kinda) zeo.  None of these will give you the
same corruption problem.

=Tony Meyer

Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.

More information about the Spambayes mailing list