[Spambayes] Better optimization loop

Neale Pickett neale@woozle.org
Wed Nov 20 21:16:25 2002


So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:

> Another speedup I could use is a version of Bayes that calculates the
> spamprob from the numbers on demand instead of calculating them for
> all words everytime. This pays of for all cases where the training
> batch is very small (~1 message).

Funny you should bring that up, Rob, because I happen to be working on
exactly that.  The only way I could think to do it was to pass in a new
option to Bayes.learn() and Bayes.unlearn().

I've therefore removed the update_probabilities option and replaced it
with update_word_probabilities.  My thinking here is that asking things to
run Bayes.update_probabilities() when they need it isn't too much of a
burden (most of them call it explicitly anyway), but learn() and
unlearn() are the *only* places that individual word rescoring can
happen.

The changed methods become:

    def learn(self, wordstream, is_spam, update_word_probabilities=True):
        self._add_msg(wordstream, is_spam, update_word_probabilities)

    def unlearn(self, wordstream, is_spam, update_word_probabilities=True):
        self._remove_msg(wordstream, is_spam, update_word_probabilities)

    def _add_msg(self, wordstream, is_spam, update_word_probabilities):
        ...

    def _remove_msg(self, wordstream, is_spam, update_word_probabilities):
        ...

And inside the for loop in _add_msg() and _remove_msg() is this:

            if update_word_probabilities:
                self.update_word_probability(word, record)
            else:
                # Needed to tell a persistent DB that the content
                # changed.
                wordinfo[word] = record

I'll check all this in to the hammie-playground branch as soon as I can
be sure it doesn't break anything.  If we all think it's kosher, I'll
merge it into HEAD.

Neale




More information about the Spambayes mailing list