[spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()

Fri Jul 14 23:18:22 CEST 2006

Kenny,
Thanks for the reply.

With the definitions of spamcount and hamcount it makes sense that
they might be zero, since there is minimal training data in the
system, and the word being scored does not exist in the database.

This might be some sort of small bug with running the filter on a
small amount of data, as I can reliably replicate a divide by zero
error.  If spamcount and hamcount are both zero, shouldn't the system
return some sort of 0% probability for spam or ham (showing it's
uncertainty for the phrase being scored)?

Here is a script which trains one phrase as ham and one phrase as
spam, then tries to filter a phrase containing a number of words which
don't exist in the system.  (I didn't include my pgsql connection
details, but it's running on the pgsql connector if that matters)

#!/usr/bin/python
from spambayes import hammie

h  = hammie.open(dbinfo,dbtype,'w')
h.train_ham('here are some pictures from our trip to africa, i hope
you enjoy them')
h.store()
h.train_spam('refinance your mortgage with cilias!')
h.store()
h.filter('do you want some viagra')

It seems to just be not catching the exception (you should be able to
try to score text with little to no information present in the
database about what is spam and what is ham -- it should just be
unsure of it).

If change line 320 of classify.py (i'm using the latest 1.1a1 release
now) to a very simple try/except clause:
        try:
          prob = spamratio / (hamratio + spamratio)
        except:
          prob = 0

You can't replicate the error with the above script.

Is this a patch that should be submitted?  Is there a method for
submitting this?

Thanks!
Todd

On 7/14/06, Kenny Pitt <kenny.pitt at gmail.com> wrote:
> [I'm moving this over to spambayes-dev because it deals more with the code]
>
> On 7/13/06, Todd Kennedy <todd.kennedy at gmail.com> wrote:
> > I'm trying to integrate the spambayes package into my blogging
> > software as a comment spam filter.  I've read through a bunch of the
> > source, looked at the scripts provided and stuff and have a
> > rudimentary understanding of how the software works.  (i think).  but
> > i'm getting a ZeroDivisionError when I try to run the score method of
> > hammie.
> >
> > [...]
> >
> > The exception occurs at:
> >  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
> > line 320, in probability
> >    prob = spamratio / (hamratio + spamratio)
> > ZeroDivisionError: float division
> >
> > I put in some simple print statements to print out nham, nspam,
> > spamcount and hamcount.  this is their output:
> > 22:14:52 (~)
> > todd at mothra> ./test_sp.py
> > spamcount 6
> > hamcount 6
> > nham 6
> > nspam 6
> > spamcount 6
> > hamcount 6
> > spamcount 6
> > hamcount 6
> > spamcount 6
> > hamcount 6
> > spamcount 0
> > hamcount 0
> > nham 6
> > nspam 6
> >
> > why would spamcount and hamcount go to 0?
>
> From the WordInfo class comments in classifier.py:
>
>     # ... spamcount is the
>     # number of trained spam msgs in which the word appears, and hamcount
>     # the number of trained ham msgs.
>
> So spamcount would be 0 if the current word has never been seen in a
> trained spam message, and similarly for hamcount. A word will only
> appear in the training database if it has appeared in at least one
> message so you should never have a word with both counts 0. The
> _worddistanceget() function in the Classifier class deals with this by
> assigning a default probability to any word that does not appear in
> the training data, so the probability calculation should only run on
> trained words.
>
> It's hard to say how the code might have ended up in the probability()
> function with a word that wasn't in the training data. It might help
> to print which word produced each of the spamcount/hamcount pairs and
> compare those against the training data to see if there are any that
> don't appear in the training.
>
> It would also be interesting to know if you have ever tried to remove
> a message from the training data (i.e. untrain the message). When a
> message is removed, each word is checked to see if both counts have
> gone to 0 (see the _remove_msg function) and the word should be
> removed from the training data in that case. I see that you are using
> the Postgres storage engine. I'm guessing a little here, but I don't
> think Postgres has received as much testing as some of the other
> storage formats so it might be possible that the record didn't
> actually get deleted from the training database once both counts went
> to 0.
>
> --
> Kenny Pitt
>