key/value store optimized for disk storage

Paul Rubin no.email at nospam.invalid
Fri May 4 02:03:02 EDT 2012


Steve Howell <showell30 at yahoo.com> writes:
> Sounds like a useful technique.  The text snippets that I'm
> compressing are indeed mostly English words, and 7-bit ascii, so it
> would be practical to use a compression library that just uses the
> same good-enough encodings every time, so that you don't have to write
> the encoding dictionary as part of every small payload.

Zlib stays adaptive, the idea is just to start with some ready-made
compression state that reflects the statistics of your data.

> Sort of as you suggest, you could build a Huffman encoding for a
> representative run of data, save that tree off somewhere, and then use
> it for all your future encoding/decoding.

Zlib is better than Huffman in my experience, and Python's zlib module
already has the right entry points.  Looking at the docs,
Compress.flush(Z_SYNC_FLUSH) is the important one.  I did something like
this before and it was around 20 lines of code.  I don't have it around
any more but maybe I can write something else like it sometime.

> Is there a name to describe this technique?

Incremental compression maybe?



More information about the Python-list mailing list