[Spambayes] Better optimization loop

Wed Nov 20 21:28:51 2002

Neale Pickett wrote:
> So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:
> 
> 
>>Another speedup I could use is a version of Bayes that calculates the
>>spamprob from the numbers on demand instead of calculating them for
>>all words everytime. This pays of for all cases where the training
>>batch is very small (~1 message).

> And inside the for loop in _add_msg() and _remove_msg() is this:
> 
>             if update_word_probabilities:
>                 self.update_word_probability(word, record)
>             else:
>                 # Needed to tell a persistent DB that the content
>                 # changed.
>                 wordinfo[word] = record

I was thinking along different lines: when the train size and the score 
size are both approximately 1 message, we can forget about the word 
probabilities altogether. Just don't store them anywhere anymore, and 
calculate the individual word probabilities from the raw counts while 
scoring. This will not only save time because lots of words that enter 
the database will "never" be used again (hapaxes...), but it should also 
shrink the database size. If it is too slow then we can make a cache out 
of a dictionary mapping raw count tuples to probabilities to speed it up.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/