[Spambayes] Bug in classifier.py

Björn Sandberg bjorn at strakt.com
Wed Oct 22 05:20:16 EDT 2003


On Fri, 17 Oct 2003, Tony Meyer wrote:

> > I have just downloaded version 1.0a6.1, and after some
> > tinkering got it to work properly with procmail. However,
> > every time I get a "ham" message, the sb_filter script
> > generates a traceback:
> [...]
> >   File "/u/bjorn/src/spambayes/spambayes/classifier.py", line
> > 243, in probability
> >     assert hamcount <= nham
>
> This means that the database is no longer good, i.e. that you have seen a
> ham token - "content-type:text/plain", in more ham messages than you have
> trained on.

[snip]

> In any case, the solution is to either retrain from scratch, or to flatten
> your database to a text file (sb_dbexpimp.py), fix the totals at the top,
> and then unflatten it again.

Things are working - sort of. After rebuilding my database from scratch,
sb_filter is working nicely. However, I have been forced to stop the
automatic updating of the database, since chances are that mboxtrain will
fail.

I'm running SpamBayes 1.0a6.1 with procmail doing the actual mail sorting.
Spam goes into a normal, 'mbox' file, other mail uses MH folders.

I have a cron job that runs in the early morning to train on new incoming
mail. This appears to set the X-Spambayes-Trained header as it's supposed
to.

Whenever I find mail that's been misplaced, I move it to the 'spam'
folder, expecting sb_mboxtrain.py to retrain it the next morning.
However, my impression is that during the retraining, the
X-Spambayes-Trained header is left unmodified! This causes problems later
on, since it attempts to 'untrain' the message _twice_, triggering a
traceback.

This would not be such a big problem if sb_mboxtrain.py handled the
database in a gentle manner. The problem is that it doesn't - any failure
during the retraining procedure leaves the database in an inconsistent
state! This is what caused the earlier traceback; sb_filter is doing its
best, the problem lies in the working data.

I hope this gives you enough information to work with. To solve this
problem, you should preferably fix both bugs - the header issue as well as
gentler handling of the database info.

// Bjorn



More information about the Spambayes mailing list