key/value store optimized for disk storage

Steve Howell showell30 at yahoo.com
Thu May 3 00:08:35 EDT 2012


On May 2, 8:29 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> Steve Howell <showel... at yahoo.com> writes:
> > Thanks.  That's definitely in the spirit of what I'm looking for,
> > although the non-64 bit version is obviously geared toward a slightly
> > smaller data set.  My reading of cdb is that it has essentially 64k
> > hash buckets, so for 3 million keys, you're still scanning through an
> > average of 45 records per read, which is about 90k of data for my
> > record size.  That seems actually inferior to a btree-based file
> > system, unless I'm missing something.
>
> 1) presumably you can use more buckets in a 64 bit version; 2) scanning
> 90k probably still takes far less time than a disk seek, even a "seek"
> (several microseconds in practice) with a solid state disk.
>

Doesn't cdb do at least one disk seek as well?  In the diagram on this
page, it seems you would need to do a seek based on the value of the
initial pointer (from the 256 possible values):

http://cr.yp.to/cdb/cdb.txt

> >http://thomas.mangin.com/data/source/cdb.py
> > Unfortunately, it looks like you have to first build the whole thing
> > in memory.
>
> It's probably fixable, but I'd guess you could just use Bernstein's
> cdbdump program instead.
>
> Alternatively maybe you could use one of the *dbm libraries,
> which burn a little more disk space, but support online update.

Yup, I don't think I want to incur the extra overhead.  Do you have
any first hand experience pushing dbm to the scale of 6Gb or so?  My
take on dbm is that its niche is more in the 10,000-record range.



More information about the Python-list mailing list