[Spambayes] Bug in classifier.py

Thu Oct 16 18:05:39 EDT 2003

> I have just downloaded version 1.0a6.1, and after some 
> tinkering got it to work properly with procmail. However, 
> every time I get a "ham" message, the sb_filter script 
> generates a traceback:
[...]
>   File "/u/bjorn/src/spambayes/spambayes/classifier.py", line 
> 243, in probability
>     assert hamcount <= nham

This means that the database is no longer good, i.e. that you have seen a
ham token - "content-type:text/plain", in more ham messages than you have
trained on.

> Some debug printouts later, it seems that the token 
> 'content-type:text/plain' is the bad guy - one of the 'ham' 
> messages I used for training was a bounce message, which 
> included the original email headers in the message body. The 
> comments in classifier.py seem to indicate that this isn't a 
> new issue either.

Do you mean that if you train on this message you get the
"content-type:text/plain" token incremented twice, and not once?  A token
should only ever be incremented once per message, no matter how many times
it appears.  If this is the case, then it's a serious bug.

In any case, the solution is to either retrain from scratch, or to flatten
your database to a text file (sb_dbexpimp.py), fix the totals at the top,
and then unflatten it again.

=Tony Meyer