[Spambayes] Better optimization loop
Neale Pickett
neale@woozle.org
Wed Nov 20 21:16:25 2002
So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:
> Another speedup I could use is a version of Bayes that calculates the
> spamprob from the numbers on demand instead of calculating them for
> all words everytime. This pays of for all cases where the training
> batch is very small (~1 message).
Funny you should bring that up, Rob, because I happen to be working on
exactly that. The only way I could think to do it was to pass in a new
option to Bayes.learn() and Bayes.unlearn().
I've therefore removed the update_probabilities option and replaced it
with update_word_probabilities. My thinking here is that asking things to
run Bayes.update_probabilities() when they need it isn't too much of a
burden (most of them call it explicitly anyway), but learn() and
unlearn() are the *only* places that individual word rescoring can
happen.
The changed methods become:
def learn(self, wordstream, is_spam, update_word_probabilities=True):
self._add_msg(wordstream, is_spam, update_word_probabilities)
def unlearn(self, wordstream, is_spam, update_word_probabilities=True):
self._remove_msg(wordstream, is_spam, update_word_probabilities)
def _add_msg(self, wordstream, is_spam, update_word_probabilities):
...
def _remove_msg(self, wordstream, is_spam, update_word_probabilities):
...
And inside the for loop in _add_msg() and _remove_msg() is this:
if update_word_probabilities:
self.update_word_probability(word, record)
else:
# Needed to tell a persistent DB that the content
# changed.
wordinfo[word] = record
I'll check all this in to the hammie-playground branch as soon as I can
be sure it doesn't break anything. If we all think it's kosher, I'll
merge it into HEAD.
Neale
More information about the Spambayes
mailing list