key/value store optimized for disk storage
Steve Howell
showell30 at yahoo.com
Fri May 4 01:09:04 EDT 2012
On May 3, 9:38 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> Steve Howell <showel... at yahoo.com> writes:
> > My test was to write roughly 4GB of data, with 2 million keys of 2k
> > bytes each.
>
> If the records are something like english text, you can compress
> them with zlib and get some compression gain by pre-initializing
> a zlib dictionary from a fixed english corpus, then cloning it.
> That is, if your messages are a couple paragraphs, you might
> say something like:
>
> iv = (some fixed 20k or so of records concatenated together)
> compressor = zlib(iv).clone() # I forget what this
> # operation is actually called
>
> # I forget what this is called too, but the idea is you throw
> # away the output of compressing the fixed text, and sync
> # to a byte boundary
> compressor.sync()
>
> zout = compressor.compress(your_record).sync()
> ...
>
> i.e. the part you save in the file is just the difference
> between compress(corpus) and compress(corpus_record). To
> decompress, you initialize a compressor the same way, etc.
>
> It's been a while since I used that trick but for json records of a few
> hundred bytes, I remember getting around 2:1 compression, while starting
> with an unprepared compressor gave almost no compression.
Sounds like a useful technique. The text snippets that I'm
compressing are indeed mostly English words, and 7-bit ascii, so it
would be practical to use a compression library that just uses the
same good-enough encodings every time, so that you don't have to write
the encoding dictionary as part of every small payload.
Sort of as you suggest, you could build a Huffman encoding for a
representative run of data, save that tree off somewhere, and then use
it for all your future encoding/decoding.
Is there a name to describe this technique?
More information about the Python-list
mailing list