[spambayes-dev] More obvious logarithmic expiration data

Matthew Dixon Cowles matt at mondoinfo.com
Mon Jun 9 16:56:43 EDT 2003

>>2587 sets of scores processed
>>Number of scores that differ from actual score
>>by 0.00           6885
>>by 0.01 or less    633
>>by 0.10 or less    179
>>by 0.20 or less     32
>>by more than 0.20   32

[Alex Popiel]
> Are these numbers from the within-24-hours number to the
> within-30-days number, or the within-7-days number to rhe
> within-30-days number (given that later you say you're comaparing
> against the 30-days number, not actual), or some combination of
> both?

It's a combination. Four scores were computed for each message and
the within-two-weeks, within-one-week, and within-24-hours scores
were compared to the within-30-days score. Presumably, the larger
differences are from the comparisons with the results that use the
shorter cutoffs.

>> Judging from this data, I could relatively painlessly use a
>> database that contains only those tokens that have figured in
>> scoring in the last ten days or so. That's about 11% of the 273487
>> tokens in my database.

> Nifty.  Do you have any provision for retaining (or desire to
> retain) words that were used a lot, but suddenly go through an N+1
> day dry spell where they aren't used at all?

I've thought a bit of that. It might be useful to bias the delete
function toward retaining a token that hadn't been used for longer
periods as a function of hamcount+spamcount. Some more work could
determine if that's a valuable strategy.

>>And, of course, it's not really time that counts but rather the
>>number of emails seen.
> I'm not so convinced of this.  One of the things we're dealing
> with is spam mutation rate, which I believe is independent of
> how much mail any one person receives.

I agree but I meant something simpler than that. If I were on
vacation for two weeks and therefore hadn't scored any messages in
that time, it wouldn't make sense to expire my entire database.


