[Spambayes] Results of playing with CDB

12 Sep 2002 00:26:21 -0700

In keeping with my tradition of doing precisely what the goal of this
project isn't, I've been tinkering with CDB as a storage back-end.  The
hype about CDB is that it's incredibly fast to look stuff up, but not so
fast at making updates.  In fact, you have to rewrite the entire
database if you update anything.

So what I did was write a front-end that caches writes, and then outputs
a new database on destruction.  I wrapped it around Neil's cdb.py with
cdbwrap.py (now in CVS).  Then, I wrote a cdbhammie.py to use it.

So far, it outputs a *giant* text file which can be converted to cdb
with djb's cdbmake (much faster than doing it in Python).  My text file
weighs in at a ridiculous 106 million bytes.  Maybe something changed
with the tokenizer to make it so much larger, or maybe it's that I'm
using a derivative of shelve.Shelf in cdbhammie.  

I suspect that the 106MB file is indicative of something really wrong
with what I'm doing, since the same dataset only made me a 20MB file
with a Berkeley hash a few days ago, and CDB appears to be a very
lightweight format.  It scores a message in under 1s, which I suppose is
pretty impressive considering the immensity of the database it's working
with (111MB).  So it may well outperform Berkeley hashes with the same
data.

Anyhow, I'm out of hours in the day to mess with it.  If anyone else
wants to poke around at it, be my guest.

Neale