[Spambayes] Better optimization loop

Thu Nov 21 01:31:50 2002

In message:  <w53lm3nzsnb.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>
>What I've been doing in my idle time for the past few hours is playing
>around with having the WordInfo class compute its own probability.

[snip]

>My idea was that you'd have to score the probability for each word
>whenever you use it first, but after that the probability is cached.
>Long-running things like the pop proxy will get the benefit of the
>cached probabilities, and short-lived things like hammiefilter get much
>faster training, and only slightly slower scoring.  At least, that's
>what I expect.  I haven't tested this yet.

What this seems to lack is a good (cheap) way to invalidate the
cache.  Since changing the amount of training data affects the
bayesian adjustment to the probability for just about every word
in the database, being able to invalidate the cache is important.
(Yes, I know I keep harping on this, but a lot of the ideas
circulating on this topic seem to ignore it.)

FWIW, I did a small time test on the patch I posted... and it seems
to run marginally faster than the original code in the classic timcv
setting.  I think that getting rid of tracking the timestamps (and
making the change non-optional, unlike the first buggy version I
mentioned about a week ago) offset the added work of checking mutiple
places on a cache miss.

Of course, it'll be much faster than dealing with update_probabilities
in the fine-grained train-a-few, classify-a-few, train-a-few-again
setting... but I haven't actually tested that.  I need to do that.

- Alex