[Spambayes] ageing out database entries

Tim Peters tim.one at comcast.net
Mon Nov 10 16:33:03 EST 2003


[Kenny Pitt]
> ...
> In the training database, both K9 and SpamBayes store only a list of
> tokens with counts of how many times each has been seen in spam and in
> ham.  No other information is stored about the original message that
> the token was seen in.  The most effective way of aging out tokens
> would seem to be to keep track of the date that each token was last
> seen, and set a threshold that says if a token has not been seen in n
> days then remove it from the training data.  Unfortunately, this adds
> a significant amount of size to the training database as well as
> increasing the amount of work to be done when classifying a message
> (thus decreasing the performance).

SpamBayes originally saved a lot more info about each token, including a
timestamp recording its most recent use in scoring.  The effect on database
size is indeed large, but the effect on processing time is minor.  At that
time, SpamBayes scored at least 80 messages/second on my home machine, and
it's slower than that now (mostly due to I/O costs and fancier-- despite
leaner --database schemes).

The extra fields were deleted because nobody had figured out a compelling
use for them.  Part of the problem in designing an expiration scheme is that
bags of words get added on a per-message basis, so should be removed on a
per-message basis too.  Then you have to coordinate a message database with
the token database, or expand the token database to remember the bags of
tokens it was trained on.

So that's plenty of work, while most of the developers still have databases
so small that it's hard to find them on a modern disk <wink>.




More information about the Spambayes mailing list