[Spambayes] Speeding DBDictClassifier

Tim Peters tim_one at email.msn.com
Mon May 26 03:29:55 EDT 2003


I haven't looked at this code before.  It appears that
DBDictClassifier.store() writes to the DB for every word in the wordinfo
dict, whether or not the info associated with the word has changed; and
doesn't clear the wordinfo dict at the end, so that the next time .store()
is called it will write every word all over again, and store() becomes more
expensive every time it's called.

Maybe that's essential <wink>.

Sketch of a different approach; the thrust is to change store() so that it
only touches the database records that actually changed since the last time
store() was called.

load():  set new instance vars
    self.changed_words = {}
    self.deleted_words = {}

store():
    don't mutate wordinfo at all; don't iterate over wordinfo at all
    delete from the DB:  only the words in self.deleted_words
    update in the DB: only the words in self.changed_words
    clear changed_words and deleted_words before returning

_wordinfoget():
    remove the comment about None (it's no longer special)
    after "ret = None", do

        if word in self.deleted_words:
            return ret

_wordinfoset():
    define this

    def _wordinfoset(self, word, record):
        self.wordinfo[word] = record
        if word in self.deleted_words:
            del self.deleted_words[word]
        self.changed_words[word] = 1


_wordinfodel():
    change to:

    def _wordinfodel(self, word):
        del wordinfo[word]
        if word in self.changed_words:
            del self.changed_words[word]
        self.deleted_words[word] = 1




More information about the Spambayes mailing list