[spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()

Todd Kennedy todd.kennedy at gmail.com
Sat Jul 15 23:56:19 CEST 2006


Tim,

Thanks for the reply.  I understand what you're talking about with
papering over the problem.

I've included the full traceback that you get when you run the script
I provided.  Hopefully this will provide some information.  Any ideas
on how to resolve this would be great -- I'm moderately new to Python.
 Also, I upgraded to 1.1a2 and it's still occuring...

17:53:27 (~/src/spambayes)
todd at mothra> ./test.py
Traceback (most recent call last):
  File "./test.py", line 9, in ?
    h.filter('do you want some viagra')
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 155, in filter
    debug, train)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 109, in score_and_filter
    prob, clues = self._scoremsg(msg, True)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py",
line 38, in _scoremsg
    return self.bayes.spamprob(tokenize(msg), evidence)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 196, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 499, in _getclues
    tup = self._worddistanceget(word)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 514, in _worddistanceget
    prob = self.probability(record)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py",
line 320, in probability
    prob = spamratio / (hamratio + spamratio)
ZeroDivisionError: float division

On 7/14/06, Tim Peters <tim.peters at gmail.com> wrote:
> [Todd Kennedy]
> > With the definitions of spamcount and hamcount it makes sense that
> > they might be zero, since there is minimal training data in the
> > system, and the word being scored does not exist in the database.
> >
> > This might be some sort of small bug with running the filter on a
> > small amount of data, as I can reliably replicate a divide by zero
> > error.  If spamcount and hamcount are both zero, shouldn't the system
> > return some sort of 0% probability for spam or ham (showing it's
> > uncertainty for the phrase being scored)?
>
> Yes, and it does.  That's what Kenny tried to tell you :-)  This is
> Classifier._worddistanceget():
>
>     def _worddistanceget(self, word):
>         record = self._wordinfoget(word)
>         if record is None:
>             prob = options["Classifier", "unknown_word_prob"]
>         else:
>             prob = self.probability(record)
>         distance = abs(prob - 0.5)
>         return distance, prob, word, record
>
> If there is no record for the word, then this returns the value of the
> "unknown_word_prob" option.  It only tries to _compute_ the
> probability if there _is_ a record for the word, and it should never
> be the case that a record exists for a word with hamcount and
> spamcount both 0.
>
> It would be helpful to dump print statements into that function (or
> run under Python's debugger) to see exactly which word it is and
> what's in that record -- or possibly you'd discover that
> _worddistanceget() isn't being called at all.  You didn't include a
> complete traceback in your original message, so it's impossible from
> here to guess who called probability() to begin with.  A complete
> traceback would help.
>
> > ...
> > If change line 320 of classify.py (i'm using the latest 1.1a1 release
> > now) to a very simple try/except clause:
> >         try:
> >           prob = spamratio / (hamratio + spamratio)
> >         except:
> >           prob = 0
> >
> > You can't replicate the error with the above script.
> >
> > Is this a patch that should be submitted?
>
> No, because that slows down a speed-critical function to paper over a
> problem that should never occur.  The bug isn't that this is dividing
> by 0, the bug is that probability() is being _called_ when both counts
> are 0.  Something, somewhere, on the path _toward_ calling
> probability() is in error.
>


More information about the spambayes-dev mailing list