[Spambayes] Back to language issue (long)

Skip Montanaro skip at pobox.com
Sat Mar 29 22:31:00 EST 2003

    TimP> but do have a subtler effect:  they bloat the database size.

    TimS> If I recall correctly, single occurance words are called hapaxes,
    TimS> right?  We've talked about aging before, but it seems like it
    TimS> would be clearly a good thing to age hapaxes.  After a while, ALL
    TimS> they will do is bloat the database, which is arguably a bad thing.

I retrain on my entire saved email collection periodically.  After a full
retrain, I delete all hapaxes (well, I copy the database except for the
hapaxes it contains).  It cuts the database size roughly in half, and if,
after adding more messages, those tokens are no longer hapaxes, they will be
kept after the next retrain.

Seems to work for me.


