[spambayes-dev] RE: [Spambayes] How low can you go?
Justin Mason
jm at jmason.org
Wed Dec 17 13:59:02 EST 2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Tim Peters writes:
> [Skip Montanaro]
> > Size definitely does matter. <wink> With both bigrams and my set/used
> > timestamps (datetime objects), the size of the database ballooned. I
> > think the set timestamp could be dispensed with and the last used
> > timestamp converted to something smaller, like a YYYYMMDD string.
>
> A small integer should be enough for last-used, like the number of days
> between the day the database was first created and the day a feature was
> most recently used in scoring. That's easily computed, easy to use *in*
> computations, and consumes no more than 3 bytes in a binary pickle (proto 1
> or proto 2) until about 180 years after the database was created <wink>.
FWIW -- in SpamAssassin, we used to use an approximate scheme that fit the
remaining UNIX epoch into 2 bytes something like you're suggesting (by dividing
time_t by several hours and starting the current epoch from 1 Jan 2000, or
something like that).
However we found that we ran into expiry problems for large dbs and busy
sites, because that just didn't give us enough precision -- having a
granularity of hours wasn't good enough. so SpamAssassin db version 2 now
just uses a plain old long containing a time_t value, and damn the db
bloat. A bit bigger, but expiry now works reliably ;)
However a good way we found to cut down hapax db bloat was to use a
polymorphic format for the tokens in the db; if a token has spamcount < 8
and hamcount < 8, it's marshalled so that the spamcount and hamcount are
both shoved into 1 byte as a bitmask, with the high bits set.
Here's the perl code in question:
sub tok_pack {
my ($self, $ts, $th, $atime) = @_;
$ts ||= 0; $th ||= 0; $atime ||= 0;
if ($ts < 8 && $th < 8) {
return pack ("CV", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
} else {
return pack ("CVVV", TWO_LONGS_FORMAT, $ts, $th, $atime);
}
}
I do like Bill Y's "sunspots expiry" scheme though ;)
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS
iD8DBQE/4Kd2QTcbUG5Y7woRAh/DAKC6MGlXpd1bEeR2/BzTmhtH71075ACgg21j
pJ85tiGe697R3s90bP/LRS4=
=slib
-----END PGP SIGNATURE-----
More information about the spambayes-dev
mailing list