key/value store optimized for disk storage

Fri May 4 00:38:03 EDT 2012

Steve Howell <showell30 at yahoo.com> writes:
> My test was to write roughly 4GB of data, with 2 million keys of 2k
> bytes each.

If the records are something like english text, you can compress
them with zlib and get some compression gain by pre-initializing
a zlib dictionary from a fixed english corpus, then cloning it.
That is, if your messages are a couple paragraphs, you might
say something like:

  iv = (some fixed 20k or so of records concatenated together)
  compressor = zlib(iv).clone()  # I forget what this
                                 # operation is actually called

  # I forget what this is called too, but the idea is you throw
  # away the output of compressing the fixed text, and sync
  # to a byte boundary
  compressor.sync()

  zout = compressor.compress(your_record).sync()
  ...

i.e. the part you save in the file is just the difference
between compress(corpus) and compress(corpus_record).  To
decompress, you initialize a compressor the same way, etc.

It's been a while since I used that trick but for json records of a few
hundred bytes, I remember getting around 2:1 compression, while starting
with an unprepared compressor gave almost no compression.