[Spambayes] Results of playing with CDB

Thu, 12 Sep 2002 11:14:45 -0400

[Neale Pickett]
> ...
> So far, it outputs a *giant* text file which can be converted to cdb
> with djb's cdbmake (much faster than doing it in Python).  My text file
> weighs in at a ridiculous 106 million bytes.  Maybe something changed
> with the tokenizer to make it so much larger, or maybe it's that I'm
> using a derivative of shelve.Shelf in cdbhammie.

As of last night, tokenizer changes were such that a binary pickle of the
dict database trained on 4000 ham + ~2750 spam shrunk to about 7MB (down
from about 9MB before last night's changes -- I dropped character
5-gram'ming of long words with high-bit chars, replaced with a much simpler
gimmick, and a full test run said it made no difference).

> I suspect that the 106MB file is indicative of something really wrong
> with what I'm doing, since the same dataset only made me a 20MB file
> with a Berkeley hash a few days ago,

Note that database size has no meaning without at least knowing how many
messages the databse was trained on, or, better, how many keys it contains.
Binary pickles are really quite space-efficient.

> ...
> It scores a message in under 1s, which I suppose is pretty impressive
> considering the immensity of the database it's working with (111MB).  So
> it may well outperform Berkeley hashes with the same data.

I'm curious about why you resist a server-like solution?  The plain dict
version of this code scores about 80 msgs/sec on my 866MHz home box, and
that includes the time to open and read each msg from a separate file.