[spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()
tim.peters at gmail.com
Sat Jul 15 00:15:48 CEST 2006
> With the definitions of spamcount and hamcount it makes sense that
> they might be zero, since there is minimal training data in the
> system, and the word being scored does not exist in the database.
> This might be some sort of small bug with running the filter on a
> small amount of data, as I can reliably replicate a divide by zero
> error. If spamcount and hamcount are both zero, shouldn't the system
> return some sort of 0% probability for spam or ham (showing it's
> uncertainty for the phrase being scored)?
Yes, and it does. That's what Kenny tried to tell you :-) This is
def _worddistanceget(self, word):
record = self._wordinfoget(word)
if record is None:
prob = options["Classifier", "unknown_word_prob"]
prob = self.probability(record)
distance = abs(prob - 0.5)
return distance, prob, word, record
If there is no record for the word, then this returns the value of the
"unknown_word_prob" option. It only tries to _compute_ the
probability if there _is_ a record for the word, and it should never
be the case that a record exists for a word with hamcount and
spamcount both 0.
It would be helpful to dump print statements into that function (or
run under Python's debugger) to see exactly which word it is and
what's in that record -- or possibly you'd discover that
_worddistanceget() isn't being called at all. You didn't include a
complete traceback in your original message, so it's impossible from
here to guess who called probability() to begin with. A complete
traceback would help.
> If change line 320 of classify.py (i'm using the latest 1.1a1 release
> now) to a very simple try/except clause:
> prob = spamratio / (hamratio + spamratio)
> prob = 0
> You can't replicate the error with the above script.
> Is this a patch that should be submitted?
No, because that slows down a speed-critical function to paper over a
problem that should never occur. The bug isn't that this is dividing
by 0, the bug is that probability() is being _called_ when both counts
are 0. Something, somewhere, on the path _toward_ calling
probability() is in error.
More information about the spambayes-dev