[spambayes-dev] More obvious logarithmic expiration data

Matthew Dixon Cowles matt at mondoinfo.com
Mon Jun 9 16:56:43 EDT 2003


[me]
>>2587 sets of scores processed
>>Number of scores that differ from actual score
>>by 0.00           6885
>>by 0.01 or less    633
>>by 0.10 or less    179
>>by 0.20 or less     32
>>by more than 0.20   32

[Alex Popiel]
> Are these numbers from the within-24-hours number to the
> within-30-days number, or the within-7-days number to rhe
> within-30-days number (given that later you say you're comaparing
> against the 30-days number, not actual), or some combination of
> both?

It's a combination. Four scores were computed for each message and
the within-two-weeks, within-one-week, and within-24-hours scores
were compared to the within-30-days score. Presumably, the larger
differences are from the comparisons with the results that use the
shorter cutoffs.

>> Judging from this data, I could relatively painlessly use a
>> database that contains only those tokens that have figured in
>> scoring in the last ten days or so. That's about 11% of the 273487
>> tokens in my database.

> Nifty.  Do you have any provision for retaining (or desire to
> retain) words that were used a lot, but suddenly go through an N+1
> day dry spell where they aren't used at all?

I've thought a bit of that. It might be useful to bias the delete
function toward retaining a token that hadn't been used for longer
periods as a function of hamcount+spamcount. Some more work could
determine if that's a valuable strategy.

>>And, of course, it's not really time that counts but rather the
>>number of emails seen.
> 
> I'm not so convinced of this.  One of the things we're dealing
> with is spam mutation rate, which I believe is independent of
> how much mail any one person receives.

I agree but I meant something simpler than that. If I were on
vacation for two weeks and therefore hadn't scored any messages in
that time, it wouldn't make sense to expire my entire database.

Regards,
Matt




More information about the spambayes-dev mailing list