[spambayes-dev] RE: [Spambayes] How low can you go?

Justin Mason jm at jmason.org
Wed Dec 17 13:59:02 EST 2003

Hash: SHA1

Tim Peters writes:
> [Skip Montanaro]
> > Size definitely does matter. <wink> With both bigrams and my set/used
> > timestamps (datetime objects), the size of the database ballooned.  I
> > think the set timestamp could be dispensed with and the last used
> > timestamp converted to something smaller, like a YYYYMMDD string.
> A small integer should be enough for last-used, like the number of days
> between the day the database was first created and the day a feature was
> most recently used in scoring.  That's easily computed, easy to use *in*
> computations, and consumes no more than 3 bytes in a binary pickle (proto 1
> or proto 2) until about 180 years after the database was created <wink>.

FWIW -- in SpamAssassin, we used to use an approximate scheme that fit the
remaining UNIX epoch into 2 bytes something like you're suggesting (by dividing
time_t by several hours and starting the current epoch from 1 Jan 2000, or
something like that).

However we found that we ran into expiry problems for large dbs and busy
sites, because that just didn't give us enough precision -- having a
granularity of hours wasn't good enough.  so SpamAssassin db version 2 now
just uses a plain old long containing a time_t value, and damn the db
bloat.  A bit bigger, but expiry now works reliably ;)

However a good way we found to cut down hapax db bloat was to use a
polymorphic format for the tokens in the db; if a token has spamcount < 8
and hamcount < 8, it's marshalled so that the spamcount and hamcount are
both shoved into 1 byte as a bitmask, with the high bits set.

Here's the perl code in question:

  sub tok_pack {
    my ($self, $ts, $th, $atime) = @_;
    $ts ||= 0; $th ||= 0; $atime ||= 0;
    if ($ts < 8 && $th < 8) {
      return pack ("CV", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
    } else {
      return pack ("CVVV", TWO_LONGS_FORMAT, $ts, $th, $atime);

I do like Bill Y's "sunspots expiry" scheme though ;)

- --j.
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS


More information about the spambayes-dev mailing list