[spambayes-bugs] [ spambayes-Bugs-777026 ] Possible cause for db
corruption in storage.py/DBDictClassif
SourceForge.net
noreply at sourceforge.net
Thu Jul 24 22:39:34 EDT 2003
Bugs item #777026, was opened at 2003-07-24 12:17
Message generated for change (Comment added) made by tim_one
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=777026&group_id=61702
Category: None
Group: None
>Status: Open
>Resolution: Later
Priority: 5
Submitted By: Fionn Behrens (fionn)
>Assigned to: Tim Peters (tim_one)
Summary: Possible cause for db corruption in storage.py/DBDictClassif
Initial Comment:
DBDistClassifier uses a neat trick to save some memory:
def _wordinfoset:
if record and (record.spamcount+record.hamcount
<= 1):
self.db[word] = record.__getstate__()
# Remove this word from the changed list
(not that it should be
# there, but strange things can happen :)
try:
del self.changed_words[word]
except KeyError:
pass
Unfortunately the programmer seems to have overlooked
that there might already be a self.wordinfo[word] entry
if (record.spamcount+record.hamcount) have been > 1
previously and some message has been untrained.
So, if some record is e.g. untrained from a count of 2
to a count of 1, then wordinfo[word] will still be 2
while the db[word] entry will be 1. This can lead to
minor miscounts in the spam/ham.
To circumvent the problem, the following should be
added to storage.py at line 239 (referring to version
1.0a3, right below the code part you see above):
try:
del self.wordinfo[word]
except KeyError:
pass
----------------------------------------------------------------------
>Comment By: Tim Peters (tim_one)
Date: 2003-07-25 00:39
Message:
Logged In: YES
user_id=31435
Well, the reason this works is really quite subtle: the
classifier's _remove_msg and _add_msg methods *mutate*
the WordInfo record in the wordinfo dict, and pass the
mutated version on to _wordinfoset. That's why they never
get out of synch.
In fact, you can add this line to the start of _wordinfoset's
outermost else clause:
assert self.wordinfo[word] is record
and it won't fail. The assignment
self.wordinfo[word] = record
isn't actually needed in _wordinfoset! It always rebinds
self.wordinfo[word] to the object it was already bound to.
But this isn't apparent from the guts of _wordinfoset, it's a
property that follows from what holds at the only two places
_wordinfoset is called from the classifier, and that
_wordinfoget fills in wordinfo[word] too.
This is really too delicate to bear. Reopening and assigning to
me. While calls of _wordinfoset from the classifier happen
always to pass the same record on to _wordinfoset as they
got from the wordinfo dict, I'm not sure all calls everywhere
do this. Better to make it bulletproof than to rely on this.
----------------------------------------------------------------------
Comment By: Mark Hammond (mhammond)
Date: 2003-07-24 23:25
Message:
Logged In: YES
user_id=14198
The code currently works, as in the case you describe
self.wordinfo[key] is still correctly set to 1. Thus, the
_wordinfoget() gets the correct value.
1.3 is quite out of date - other bugs have been fixed since
then. However, I added a test\test_storage.py file that
tries to exercise these edge cases - if you believe there is
still a bug, please provoke that into failing.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=777026&group_id=61702
More information about the Spambayes-bugs
mailing list