Re: [spambayes-dev] [Spambayes] ZeroDivisionError with hammie.score()
[I'm moving this over to spambayes-dev because it deals more with the code] On 7/13/06, Todd Kennedy <todd.kennedy@gmail.com> wrote:
I'm trying to integrate the spambayes package into my blogging software as a comment spam filter. I've read through a bunch of the source, looked at the scripts provided and stuff and have a rudimentary understanding of how the software works. (i think). but i'm getting a ZeroDivisionError when I try to run the score method of hammie.
[...]
The exception occurs at: File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 320, in probability prob = spamratio / (hamratio + spamratio) ZeroDivisionError: float division
I put in some simple print statements to print out nham, nspam, spamcount and hamcount. this is their output: 22:14:52 (~) todd@mothra> ./test_sp.py spamcount 6 hamcount 6 nham 6 nspam 6 spamcount 6 hamcount 6 spamcount 6 hamcount 6 spamcount 6 hamcount 6 spamcount 0 hamcount 0 nham 6 nspam 6
why would spamcount and hamcount go to 0?
From the WordInfo class comments in classifier.py:
# ... spamcount is the # number of trained spam msgs in which the word appears, and hamcount # the number of trained ham msgs. So spamcount would be 0 if the current word has never been seen in a trained spam message, and similarly for hamcount. A word will only appear in the training database if it has appeared in at least one message so you should never have a word with both counts 0. The _worddistanceget() function in the Classifier class deals with this by assigning a default probability to any word that does not appear in the training data, so the probability calculation should only run on trained words. It's hard to say how the code might have ended up in the probability() function with a word that wasn't in the training data. It might help to print which word produced each of the spamcount/hamcount pairs and compare those against the training data to see if there are any that don't appear in the training. It would also be interesting to know if you have ever tried to remove a message from the training data (i.e. untrain the message). When a message is removed, each word is checked to see if both counts have gone to 0 (see the _remove_msg function) and the word should be removed from the training data in that case. I see that you are using the Postgres storage engine. I'm guessing a little here, but I don't think Postgres has received as much testing as some of the other storage formats so it might be possible that the record didn't actually get deleted from the training database once both counts went to 0. -- Kenny Pitt
Kenny, Thanks for the reply. With the definitions of spamcount and hamcount it makes sense that they might be zero, since there is minimal training data in the system, and the word being scored does not exist in the database. This might be some sort of small bug with running the filter on a small amount of data, as I can reliably replicate a divide by zero error. If spamcount and hamcount are both zero, shouldn't the system return some sort of 0% probability for spam or ham (showing it's uncertainty for the phrase being scored)? Here is a script which trains one phrase as ham and one phrase as spam, then tries to filter a phrase containing a number of words which don't exist in the system. (I didn't include my pgsql connection details, but it's running on the pgsql connector if that matters) #!/usr/bin/python from spambayes import hammie h = hammie.open(dbinfo,dbtype,'w') h.train_ham('here are some pictures from our trip to africa, i hope you enjoy them') h.store() h.train_spam('refinance your mortgage with cilias!') h.store() h.filter('do you want some viagra') It seems to just be not catching the exception (you should be able to try to score text with little to no information present in the database about what is spam and what is ham -- it should just be unsure of it). If change line 320 of classify.py (i'm using the latest 1.1a1 release now) to a very simple try/except clause: try: prob = spamratio / (hamratio + spamratio) except: prob = 0 You can't replicate the error with the above script. Is this a patch that should be submitted? Is there a method for submitting this? Thanks! Todd On 7/14/06, Kenny Pitt <kenny.pitt@gmail.com> wrote:
[I'm moving this over to spambayes-dev because it deals more with the code]
On 7/13/06, Todd Kennedy <todd.kennedy@gmail.com> wrote:
I'm trying to integrate the spambayes package into my blogging software as a comment spam filter. I've read through a bunch of the source, looked at the scripts provided and stuff and have a rudimentary understanding of how the software works. (i think). but i'm getting a ZeroDivisionError when I try to run the score method of hammie.
[...]
The exception occurs at: File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 320, in probability prob = spamratio / (hamratio + spamratio) ZeroDivisionError: float division
I put in some simple print statements to print out nham, nspam, spamcount and hamcount. this is their output: 22:14:52 (~) todd@mothra> ./test_sp.py spamcount 6 hamcount 6 nham 6 nspam 6 spamcount 6 hamcount 6 spamcount 6 hamcount 6 spamcount 6 hamcount 6 spamcount 0 hamcount 0 nham 6 nspam 6
why would spamcount and hamcount go to 0?
From the WordInfo class comments in classifier.py:
# ... spamcount is the # number of trained spam msgs in which the word appears, and hamcount # the number of trained ham msgs.
So spamcount would be 0 if the current word has never been seen in a trained spam message, and similarly for hamcount. A word will only appear in the training database if it has appeared in at least one message so you should never have a word with both counts 0. The _worddistanceget() function in the Classifier class deals with this by assigning a default probability to any word that does not appear in the training data, so the probability calculation should only run on trained words.
It's hard to say how the code might have ended up in the probability() function with a word that wasn't in the training data. It might help to print which word produced each of the spamcount/hamcount pairs and compare those against the training data to see if there are any that don't appear in the training.
It would also be interesting to know if you have ever tried to remove a message from the training data (i.e. untrain the message). When a message is removed, each word is checked to see if both counts have gone to 0 (see the _remove_msg function) and the word should be removed from the training data in that case. I see that you are using the Postgres storage engine. I'm guessing a little here, but I don't think Postgres has received as much testing as some of the other storage formats so it might be possible that the record didn't actually get deleted from the training database once both counts went to 0.
-- Kenny Pitt
[Todd Kennedy]
With the definitions of spamcount and hamcount it makes sense that they might be zero, since there is minimal training data in the system, and the word being scored does not exist in the database.
This might be some sort of small bug with running the filter on a small amount of data, as I can reliably replicate a divide by zero error. If spamcount and hamcount are both zero, shouldn't the system return some sort of 0% probability for spam or ham (showing it's uncertainty for the phrase being scored)?
Yes, and it does. That's what Kenny tried to tell you :-) This is Classifier._worddistanceget(): def _worddistanceget(self, word): record = self._wordinfoget(word) if record is None: prob = options["Classifier", "unknown_word_prob"] else: prob = self.probability(record) distance = abs(prob - 0.5) return distance, prob, word, record If there is no record for the word, then this returns the value of the "unknown_word_prob" option. It only tries to _compute_ the probability if there _is_ a record for the word, and it should never be the case that a record exists for a word with hamcount and spamcount both 0. It would be helpful to dump print statements into that function (or run under Python's debugger) to see exactly which word it is and what's in that record -- or possibly you'd discover that _worddistanceget() isn't being called at all. You didn't include a complete traceback in your original message, so it's impossible from here to guess who called probability() to begin with. A complete traceback would help.
... If change line 320 of classify.py (i'm using the latest 1.1a1 release now) to a very simple try/except clause: try: prob = spamratio / (hamratio + spamratio) except: prob = 0
You can't replicate the error with the above script.
Is this a patch that should be submitted?
No, because that slows down a speed-critical function to paper over a problem that should never occur. The bug isn't that this is dividing by 0, the bug is that probability() is being _called_ when both counts are 0. Something, somewhere, on the path _toward_ calling probability() is in error.
Tim, Thanks for the reply. I understand what you're talking about with papering over the problem. I've included the full traceback that you get when you run the script I provided. Hopefully this will provide some information. Any ideas on how to resolve this would be great -- I'm moderately new to Python. Also, I upgraded to 1.1a2 and it's still occuring... 17:53:27 (~/src/spambayes) todd@mothra> ./test.py Traceback (most recent call last): File "./test.py", line 9, in ? h.filter('do you want some viagra') File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py", line 155, in filter debug, train) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py", line 109, in score_and_filter prob, clues = self._scoremsg(msg, True) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 196, in chi2_spamprob clues = self._getclues(wordstream) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 499, in _getclues tup = self._worddistanceget(word) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 514, in _worddistanceget prob = self.probability(record) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", line 320, in probability prob = spamratio / (hamratio + spamratio) ZeroDivisionError: float division On 7/14/06, Tim Peters <tim.peters@gmail.com> wrote:
[Todd Kennedy]
With the definitions of spamcount and hamcount it makes sense that they might be zero, since there is minimal training data in the system, and the word being scored does not exist in the database.
This might be some sort of small bug with running the filter on a small amount of data, as I can reliably replicate a divide by zero error. If spamcount and hamcount are both zero, shouldn't the system return some sort of 0% probability for spam or ham (showing it's uncertainty for the phrase being scored)?
Yes, and it does. That's what Kenny tried to tell you :-) This is Classifier._worddistanceget():
def _worddistanceget(self, word): record = self._wordinfoget(word) if record is None: prob = options["Classifier", "unknown_word_prob"] else: prob = self.probability(record) distance = abs(prob - 0.5) return distance, prob, word, record
If there is no record for the word, then this returns the value of the "unknown_word_prob" option. It only tries to _compute_ the probability if there _is_ a record for the word, and it should never be the case that a record exists for a word with hamcount and spamcount both 0.
It would be helpful to dump print statements into that function (or run under Python's debugger) to see exactly which word it is and what's in that record -- or possibly you'd discover that _worddistanceget() isn't being called at all. You didn't include a complete traceback in your original message, so it's impossible from here to guess who called probability() to begin with. A complete traceback would help.
... If change line 320 of classify.py (i'm using the latest 1.1a1 release now) to a very simple try/except clause: try: prob = spamratio / (hamratio + spamratio) except: prob = 0
You can't replicate the error with the above script.
Is this a patch that should be submitted?
No, because that slows down a speed-critical function to paper over a problem that should never occur. The bug isn't that this is dividing by 0, the bug is that probability() is being _called_ when both counts are 0. Something, somewhere, on the path _toward_ calling probability() is in error.
I've included the full traceback that you get when you run the script I provided. Hopefully this will provide some information. Any ideas on how to resolve this would be great -- I'm moderately new to Python. Also, I upgraded to 1.1a2 and it's still occuring... [...]
I believe the problem is in _wordinfoget, which should return None if the word is not in the database (and this is how _worddistanceget decides whether to use the 'unknown token' probability). PGClassifier's _wordinfoget method (actually the base SQLClassifier's), which, as Kenny said, isn't widely used, is: def _wordinfoget(self, word): if isinstance(word, unicode): word = word.encode("utf-8") row = self._get_row(word) if row: item = self.WordInfoClass() item.__setstate__((row["nspam"], row["nham"])) return item else: return self.WordInfoClass() I believe this should be: def _wordinfoget(self, word): if isinstance(word, unicode): word = word.encode("utf-8") row = self._get_row(word) if row: item = self.WordInfoClass() item.__setstate__((row["nspam"], row["nham"])) return item else: return None (This is more-or-less what the mySQL storage option does). Try that change (just changing the final return from "self.WordInfoClass()" to "None"), and see if it fixes the problem. If it does, please let us know so that we can make the change in the repository as well. =Tony.Meyer
Tony, That seems to have solved the issue. By changing the _wordinfoget function for the SQLClassifier class to return None on the else case I'm no longer getting a traceback, but rather: X-Spambayes-Classification: ham; 0.02 do you want some viagra As output of the h.filter() function. Thanks much!! On 7/16/06, Tony Meyer <tameyer@ihug.co.nz> wrote:
I've included the full traceback that you get when you run the script I provided. Hopefully this will provide some information. Any ideas on how to resolve this would be great -- I'm moderately new to Python. Also, I upgraded to 1.1a2 and it's still occuring... [...]
I believe the problem is in _wordinfoget, which should return None if the word is not in the database (and this is how _worddistanceget decides whether to use the 'unknown token' probability).
PGClassifier's _wordinfoget method (actually the base SQLClassifier's), which, as Kenny said, isn't widely used, is:
def _wordinfoget(self, word): if isinstance(word, unicode): word = word.encode("utf-8")
row = self._get_row(word) if row: item = self.WordInfoClass() item.__setstate__((row["nspam"], row["nham"])) return item else: return self.WordInfoClass()
I believe this should be:
def _wordinfoget(self, word): if isinstance(word, unicode): word = word.encode("utf-8")
row = self._get_row(word) if row: item = self.WordInfoClass() item.__setstate__((row["nspam"], row["nham"])) return item else: return None
(This is more-or-less what the mySQL storage option does).
Try that change (just changing the final return from "self.WordInfoClass()" to "None"), and see if it fixes the problem. If it does, please let us know so that we can make the change in the repository as well.
=Tony.Meyer
participants (4)
-
Kenny Pitt -
Tim Peters -
Todd Kennedy -
Tony Meyer