[spambayes-dev] Re: [Spambayes] Database cleaning?

Matthew Dixon Cowles matt at mondoinfo.com
Mon Jun 2 15:59:57 EDT 2003

[Alex Popiel]
> [ snip of histogram showing an apparent exponential
>   dropoff in usage frequency ]
> Yes, this is a very interesting result.  I'm not sure it's actually
> useful, but it is pretty.

I'm not sure it is either but I'm hopeful that it may be. For
example, it says that (with my mail) if a token is used in scoring,
there's a 90% chance that it will be used again within one day, a 95%
chance it will be used again within four days, and a 98% chance that
it will be used again within two weeks.

That suggests to me that a relatively simple mechanism for database
pruning may be useful. When I have a few minutes, I plan to do some
more work to see if that's true.

> Another thing that would be interesting to plot would be a
> histogram of the average frequency each token gets used at... which
> might give us some idea of how large a DB is actually useful.

I'd be glad to poke at the data in a different way, but it's not
clear to me how that's different from what I've done. Can you tell me
a little more specifically what you mean?


