[spambayes-bugs] [ spambayes-Bugs-777026 ] Possible cause for db corruption in storage.py/DBDictClassif

SourceForge.net noreply at sourceforge.net
Thu Jul 24 22:39:34 EDT 2003


Bugs item #777026, was opened at 2003-07-24 12:17
Message generated for change (Comment added) made by tim_one
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=777026&group_id=61702

Category: None
Group: None
>Status: Open
>Resolution: Later
Priority: 5
Submitted By: Fionn Behrens (fionn)
>Assigned to: Tim Peters (tim_one)
Summary: Possible cause for db corruption in storage.py/DBDictClassif

Initial Comment:

DBDistClassifier uses a neat trick to save some memory:

    def _wordinfoset:
        if record and (record.spamcount+record.hamcount
<= 1):
            self.db[word] = record.__getstate__()
            # Remove this word from the changed list
(not that it should be
            # there, but strange things can happen :)
            try:
                del self.changed_words[word]
            except KeyError:
                pass

Unfortunately the programmer seems to have overlooked
that there might already be a self.wordinfo[word] entry
if (record.spamcount+record.hamcount) have been > 1
previously and some message has been untrained.
So, if some record is e.g. untrained from a count of 2
to a count of 1, then wordinfo[word] will still be 2
while the db[word] entry will be 1. This can lead to
minor miscounts in the spam/ham.

To circumvent the problem, the following should be
added to storage.py at line 239 (referring to version
1.0a3, right below the code part you see above):

            try:
              del self.wordinfo[word]
            except KeyError:
              pass

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2003-07-25 00:39

Message:
Logged In: YES 
user_id=31435

Well, the reason this works is really quite subtle:  the 
classifier's _remove_msg and _add_msg methods *mutate* 
the WordInfo record in the wordinfo dict, and pass the 
mutated version on to _wordinfoset.  That's why they never 
get out of synch.

In fact, you can add this line to the start of _wordinfoset's 
outermost else clause:

            assert self.wordinfo[word] is record

and it won't fail.  The assignment

            self.wordinfo[word] = record

isn't actually needed in _wordinfoset!  It always rebinds 
self.wordinfo[word] to the object it was already bound to.

But this isn't apparent from the guts of _wordinfoset, it's a 
property that follows from what holds at the only two places 
_wordinfoset is called from the classifier, and that 
_wordinfoget fills in wordinfo[word] too.

This is really too delicate to bear.  Reopening and assigning to 
me.  While calls of _wordinfoset from the classifier happen 
always to pass the same record on to _wordinfoset as they 
got from the wordinfo dict, I'm not sure all calls everywhere 
do this.  Better to make it bulletproof than to rely on this.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-07-24 23:25

Message:
Logged In: YES 
user_id=14198

The code currently works, as in the case you describe
self.wordinfo[key] is still correctly set to 1.  Thus, the
_wordinfoget() gets the correct value.

1.3 is quite out of date - other bugs have been fixed since
then. However, I added a test\test_storage.py file that
tries to exercise these edge cases - if you believe there is
still a bug, please provoke that into failing.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=777026&group_id=61702



More information about the Spambayes-bugs mailing list