[Spambayes] expiration ideas.

Anthony Baxter anthony@interlink.com.au
Mon Oct 21 07:06:53 2002


>>> Tim Peters wrote
> OTOH, what's the purpose of expiration?  I can think of two:
> 
> 1. To reduce database size.
> 
> 2. To accelerate adaptation to changes in ham and/or spam.

The former. I'm trying to think about how this could be deployed "in 
the real world". 

Note also that I'm not so much worried about adapting to spam as 
adapting to changing ham patterns. I know that my own email changes 
over time (for instance, until this project started, I doubt the word 
"Nigerian" would have been considered a strong ham indicator for me :)

(somewhat off-topic, but related: I also suspect that if the spambayes 
code is vulnerable to being deliberately sabotaged, it'll be the 
tokeniser that's the weak point, not the classifier. For instance, 
I already have a couple of persistent FNs with message bodies entirely 
encoded in javascript. I don't want to think about having to decode
javascript or run it to check if something's spam.)

I'm somewhat nervous of the "purge all unique words" approach - one
obvious failing is that it means if you _are_ doing ongoing training,
you'd want to batch up a bunch of messages. I'm also not sure that 
deliberately perverting the real world in that way isn't going against
the "stupid beats smart" meta-rule that's served us so far... 

but-then-maybe-stupider-beats-stupid, too.
Anthony.