RE: [spambayes-dev] Re: Pickle vs DB inconsistencies
Ah well, I almost had it right before... Ok more investigation prompted by trying to come up with an example for Tim. (Note, I wasn't saying that the database package was broken, just that the _wordinfo* functions in storage.py were). I can now get a list of the incorrect words by putting a print statement in two places - either all those words for which _wordinfodel() is called, or all those words for whom the "del self.changed_words[word]" line does not raise an exception in _wordinfoset(). I guess the problem is not what I guessed before (to my credit, I said that I was unsure, and that I had narrowed it down, which was true ;), but along the lines of the delete issue that Tim pointed out. I was somewhat on the right track... The problem (I am more sure, but still in the unsure range ;) is when tokens are deleted before they are written to the db. (A much nicer and easier to solve problem :) Here's example code: from spambayes.storage import DBDictClassifier from spambayes.classifier import WordInfo d = DBDictClassifier("fail.db") print "Should not be an entry" print d._wordinfoget("tok") w = WordInfo() w.hamcount = 1 d._wordinfoset("tok", w) print "Should have a ham count of 1, spam count of 0" print d._wordinfoget("tok") w.hamcount -=1 # not really necessary d._wordinfodel("tok") #d.store() # uncomment this line and it will work print "Should not be an entry (or have ham and spam of 0)" print d._wordinfoget("tok") =Tony Meyer
[Tony Meyer]
... Here's example code:
from spambayes.storage import DBDictClassifier from spambayes.classifier import WordInfo d = DBDictClassifier("fail.db") print "Should not be an entry" print d._wordinfoget("tok") w = WordInfo() w.hamcount = 1 d._wordinfoset("tok", w) print "Should have a ham count of 1, spam count of 0" print d._wordinfoget("tok") w.hamcount -=1 # not really necessary d._wordinfodel("tok") #d.store() # uncomment this line and it will work print "Should not be an entry (or have ham and spam of 0)" print d._wordinfoget("tok")
OK, I checked the change in I mentioned before, and now this program prints """ Should not be an entry None Should have a ham count of 1, spam count of 0 WordInfo(0, 1) Should not be an entry (or have ham and spam of 0) None """ Note that it should not have a spam and ham count of 0 at the end, it should return None (as it does now). As the WordInfo class comment says, # Invariant: For use in a classifier database, at least one of # spamcount and hamcount must be non-zero. I also checked in other, more cosmetic changes. If it breaks something, let me know.
On 26 June 2003, Tim Peters said:
OK, I checked the change in I mentioned before, and now this program prints
Hooray! I just cvs up'd, and the pickle/DB inconsistencies I observed have gone way. Thanks, Tony and Tim! Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ Never try to outstubborn a cat.
[Greg Ward]
Hooray! I just cvs up'd, and the pickle/DB inconsistencies I observed have gone way. Thanks, Tony and Tim!
Glad they're fixed, Greg! I was too busy to chat about it at the time, so fixed one obvious bug and went away again. How's the python.org spambayes experience going for you? I haven't noticed a spam spike since the switch, but there have been so many virus bounces the last few weeks I'm not sure I would have noticed even a large increase if there were one.
On 01 July 2003, Tim Peters said:
How's the python.org spambayes experience going for you?
AFAIK it's doing a pretty good job with spam, but as you and many others have noticed, spam is no longer the problem. Viruses with forged sender addresses are. I've just fed a whole bunch of "unsure" messages, many of which were bounces or autoreplies to viruses that forged a python.org address, into spambayes' "spam" corpus, and I'm doing a training run now (first in over a week). Will be interesting to see how well it works. My biggest gripe with spambayes is the inconsistency of the command-line tools. They're scattered around the CVS tree randomly, there are as many different way to specify the training database as there are separate scripts, and they all try to do too much. IMHO there should be one script for each of the following tasks: * training a bunch of messages * filtering a single message, ie. read it, score it, write it back with "X-..." header(s) added * scoring one or more messages, ie. read each one, score it, and print a single line with the results * export a database * import a database They should all live in a 'scripts' directory (or something), and (naturally) they should use Optik/optparse for a consistent command-line interface. Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ What happens if you touch these two wires tog--
participants (3)
-
Greg Ward -
Meyer, Tony -
Tim Peters