[Spambayes] How low can you go?
wsy at merl.com
Wed Dec 17 13:44:48 EST 2003
From: "Seth Goodman" <nobody at spamcop.net>
[... re aging out tokens ...]
Here's a particularly cute solution I implemented in CRM114.
The problem is that if you choose to store a token's last-seen
date, you will likely consume almost as much space in the storage
of the date as you will in the token count or the token hash.
But most tokens are hapaxes anyway. They have very low value, and you
probably will _never_ see them again.
So, when you need to clean up the database a little, go through and
decrement the "seen" count on a few (very few!) tokens
Choose the tokens to decrement randomly. REALLY randomly. Don't
pick one chain that's too long and decrement every element in it.
Decrement only every sixteenth one, or only the ones that have
values that, when added to the system clock, have a hash with the
low order byte == 0x00, or something like that.
Sure, you're losing information- but that's a necessary consequence of
The net result is very fast and has an acceptable level of damage to
accuracy. Tests show that, at least for CRM114 which is HEAVILY
hapax-oriented, that the damage does not increase the error rate until
you get into obscenely small databases (i.e. less than 100K slots).
Anyway, this is how <microgroom> is implemented in CRM114, and it
seems to work acceptably well.
More information about the Spambayes