[spambayes-dev] Another incremental training idea...
nobody at spamcop.net
Tue Jan 13 18:40:29 EST 2004
> It occurs to be that we need to start being careful about how we talk
> about expiry. The expiry that I've tested with the harness is based on
> taking trained messages back out of the database after a certain length
> of time. However, in real life usage, I'm completely rebuilding the
> database every night with a 4 month horizon (and likely training on a
> noticably different collection of messages each night).
I guess I don't understand why the two expiry approaches should be
different, unless the individual messages expired at precise times of the
day exactly 120 days after they were trained rather than all at once at
12:00:01 AM. I would think the differences to be rather small.
If the four-month expiry degrades the performance, as your data shows, would
a longer expiry do better? I am at a bit of a loss, since we can't keep
adding to the training database forever. At some point, and that might be
different for every mail stream, I am guessing that very old messages are no
longer contributing as much as the newer ones to accurate classification.
Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com
Spambots: disregard the above
More information about the spambayes-dev