[Spambayes] expiration ideas.

T. Alexander Popiel popiel@wolfskeep.com
Sun Oct 20 17:52:28 2002


In message:  <2124500893-BeMail@CR593174-A>
             "Alexander G. M. Smith" <agmsmith@rogers.com> writes:
>Anthony Baxter wrote:
>>   Keep the "interim" wordinfo around (gzipped, datestamped) until your
>>   expiration time is up - then undo the earlier merge, subtracting
>>   the spamcount/hamcounts. 
>> 
>> Thoughts=3F Unless there's a screamingly obvious "don't be stupid" I'll
>> play with this tomorrow (ah, leave....)
>
>Sounds reasonable.  But I'd rather keep around the whole messages so
>that I can change tokenizing schemes.  Or perhaps use one of those
>future inter-word relation schemes.

Whether you want to keep whole messages or just the wordlists
depends entirely on whether you want to fully retrain when you
switch tokenization schemes vs. keeping the old database and
just adding new stuff with the new tokenization.

If you keep the database through tokenizations, then you want a
record of what actually got added during a prior training,
instead of when would have been added if the current tokenization
was used.  Thus, the word lists are better for database integrity.

Of course, if you fully retrain every time you switch tokenizers,
then keeping the entire messages is the only way to support
arbitrary changes in the tokenizer.

It's a question of approach...

Personally, I'm keeping all messages for all time, so it doesn't
matter much one way or another.

- Alex

PS. We can really confuse folks if Alex and Alex start holding
    regular debates on the list...