[Spambayes] Back to language issue (long)
skip at pobox.com
Sat Mar 29 22:31:00 EST 2003
TimP> but do have a subtler effect: they bloat the database size.
TimS> If I recall correctly, single occurance words are called hapaxes,
TimS> right? We've talked about aging before, but it seems like it
TimS> would be clearly a good thing to age hapaxes. After a while, ALL
TimS> they will do is bloat the database, which is arguably a bad thing.
I retrain on my entire saved email collection periodically. After a full
retrain, I delete all hapaxes (well, I copy the database except for the
hapaxes it contains). It cuts the database size roughly in half, and if,
after adding more messages, those tokens are no longer hapaxes, they will be
kept after the next retrain.
Seems to work for me.
More information about the Spambayes