[Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Wed Dec 17 12:39:54 EST 2003


[Skip Montanaro]
> Size definitely does matter. <wink> With both bigrams and my set/used
> timestamps (datetime objects), the size of the database ballooned.  I
> think the set timestamp could be dispensed with and the last used
> timestamp converted to something smaller, like a YYYYMMDD string.

A small integer should be enough for last-used, like the number of days
between the day the database was first created and the day a feature was
most recently used in scoring.  That's easily computed, easy to use *in*
computations, and consumes no more than 3 bytes in a binary pickle (proto 1
or proto 2) until about 180 years after the database was created <wink>.

Especially with the bigram scheme-- which creates a relatively enormous
number of hapaxes --I expect the best use for a per-feature "last used"
timestamp is to expire hapaxes that haven't been used in scoring for N days.
That should yield major size savings, actually increase resistance to
"spectacular failures" (which so far most often seem to be associated with
hitting a large number of old hapaxes from "the other" category), and
*probably* not hurt anything else.  Expiring "near hapaxes" too gets dicier,
and more so the more liberal the conception of "near".




More information about the Spambayes mailing list