[Spambayes] Speeding DBDictClassifier
Tim Peters
tim_one at email.msn.com
Mon May 26 03:29:55 EDT 2003
I haven't looked at this code before. It appears that
DBDictClassifier.store() writes to the DB for every word in the wordinfo
dict, whether or not the info associated with the word has changed; and
doesn't clear the wordinfo dict at the end, so that the next time .store()
is called it will write every word all over again, and store() becomes more
expensive every time it's called.
Maybe that's essential <wink>.
Sketch of a different approach; the thrust is to change store() so that it
only touches the database records that actually changed since the last time
store() was called.
load(): set new instance vars
self.changed_words = {}
self.deleted_words = {}
store():
don't mutate wordinfo at all; don't iterate over wordinfo at all
delete from the DB: only the words in self.deleted_words
update in the DB: only the words in self.changed_words
clear changed_words and deleted_words before returning
_wordinfoget():
remove the comment about None (it's no longer special)
after "ret = None", do
if word in self.deleted_words:
return ret
_wordinfoset():
define this
def _wordinfoset(self, word, record):
self.wordinfo[word] = record
if word in self.deleted_words:
del self.deleted_words[word]
self.changed_words[word] = 1
_wordinfodel():
change to:
def _wordinfodel(self, word):
del wordinfo[word]
if word in self.changed_words:
del self.changed_words[word]
self.deleted_words[word] = 1
More information about the Spambayes
mailing list