pickle->zlib->shelve

Wed Oct 20 17:28:47 EDT 1999

Charles G Waldman wrote:
> 
> Thomas Weholt writes:
> 
>  > I was thinking maybe I could use pickle and compress the pickled object
>  > too, before storing it into the database and save some space. Pickled
>  > objects also seem to have lots of repetitive data.
> 
> If you zlib-compress each pickled object, you're going to get some
> savings in space, due to "internal redundancy" (i.e. low entropy) in
> the objects themselves; but it seems to me that there is also
> redundancy *between* the objects; how to exploit this is not at all
> clear to me however.

That's right. .tar.gz is in the general case better than .gz.tar, which
is in essence what compressing pickled objects would be.

> It would be nifty if you could dynamically compress/uncompress the
> whole database file, e.g. using some kind of "compressed file system",
> but this presents difficultes - AFAIK there is some limited
> "compressed file system" support for Linux but it's read-only, due to
> the obvious difficulties of positioning and writing into the middle of
> an existing compressed block.  Also "mmap" support is tricky...

There may be something better; see below.

> I also want to point out that your database files may not be as big as
> you think they are.  Gdbm files are typically sparse files, that is,
> they don't take up as much space on disk as you would assume from "ls
> -l", because these files contain "holes".

Hmmm.. this is from the gdbm docs:

"Compatibility with standard dbm and ndbm.

GNU dbm files are not sparse. You can copy them with the UNIX cp command
and they will not expand in the copying process. "

You might be thinking of dbm. (When they say "expand", I believe they
mean occupy more disk blocks, although remain the same end-length.)

>  > Awaiting flames and harsh words of discouragement,
> 
> Sorry to dissapoint you!

Now this isn't a flame but an idea. What you really need to maintain a
compressed scalable fine-grained database is a way to determine not
whether but where to use the compressor. In other words, what you need
is like .tar.gz.tar -- but the shelving algorithm has a method to
decide where to put that .gz.

The design would have to acknowledge that whenever anything is changed
on the left (inner) side of the .gz, the whole section would have to
be repickled and compressed. The more related objects you put under
the compressor, though, the better chance you have of having a lower
entropy, and hence better compression. The shelver itself would make
that call.

The right side .tar is the part of the archive which components could
be changed easily; that corresponds well with the dbm-style database.
The left .tar would be more the way that pickle recurses down into
sub-objects, forming a packed byte stream. This size of this part
would be a compromise between being small, because it would have to
be rewritten with every change, but large enough to maintain a useful
compression ratio.

The shelver could have user configurable parameters that could
characterize the expected use of the database. For example, a
mostly read-only database, with infrequent non-performance-
critical writes, would be well-served by having the compressor
towards the outside. Conversely, a dynamic database that is always
changing, but space is less of an issue would be better configured
with the compressor towards the inside.