[spambayes-dev] Another incremental training idea...

Tue Jan 13 19:13:56 EST 2004

In message:  <MHEGIFHMACFNNIMMBACACEFAHDAA.nobody at spamcop.net>
             "Seth Goodman" <nobody at spamcop.net> writes:
>[Alex Popiel]
>> It occurs to be that we need to start being careful about how we talk
>> about expiry.  The expiry that I've tested with the harness is based on
>> taking trained messages back out of the database after a certain length
>> of time.  However, in real life usage, I'm completely rebuilding the
>> database every night with a 4 month horizon (and likely training on a
>> noticably different collection of messages each night).
>
>I guess I don't understand why the two expiry approaches should be
>different, unless the individual messages expired at precise times of the
>day exactly 120 days after they were trained rather than all at once at
>12:00:01 AM.

The two methods are different because in the 'take stuff out of the
database' method, the selection of messages trained from day 2 doesn't
change and remains affected by which messages were trained on day 1,
even after day 1 has been taken out... but in the 'rebuild from scratch'
method, the selection of messages trained from day 2 (potentially)
changes when day 1 disappears over the horizon (and thus the scores
for the day 2 messages are presumably closer to .5).

>I would think the differences to be rather small.

For messages at the later end of the window, the differences probably
are small, but the differences at the earlier side of the window are
likely to be profound.

>If the four-month expiry degrades the performance, as your data shows, would
>a longer expiry do better?  I am at a bit of a loss, since we can't keep
>adding to the training database forever.  At some point, and that might be
>different for every mail stream, I am guessing that very old messages are no
>longer contributing as much as the newer ones to accurate classification.
>No?

This is an open question, and I don't think we even have a concept on
how to measure which _messages_ in a database are contributing more than
others.  I suppose you could do a 2-d scatter plot, where one axis was
the ordinal of the message being classified and the other axis was the
ordinal of any message which contained a token that was used in the
classification... lots of tiny dots, and see if it's evenly spread through
the triangle or biased to on side or another...

Why can't we just keep adding to the database forever?  My mail is
accumulating much more slowly than Moore's law, even with the exponential
growth in spam...  I can't imagine the DB growing faster than the dataset
it's based on.

- Alex