zlib interface semi-broken

Tue Feb 10 18:12:15 EST 2009

On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:
> >A simple way to fix this would be to add a finished attribute to the
> >Decompress object.
> Perhaps you could submit a patch with such a change?

Yes, I will try and get to that this week.

> >However, perhaps this would be a good time to discuss how this library
> >works; it is somewhat awkward and perhaps there are other changes which
> >would make it cleaner.
> Well, it might be improvable, I haven't really looked.  I personally
> would like it and bz2 to get closer to each other in interface, rather
> than to spread out.  SO if you are really opening up a can of worms,
> I vote for two cans.

Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some "abstract base classes", or "interfaces", for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines.  Given those, one should easily be able to
implement one-shot de/compression of strings.  In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.

After examining the bz2 module, I notice that it has a file-like
interface called bz2file, which is roughly analogous to the gzip
module.  That file interface could form a third API, and basically
conform to what python expects of files.

So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access.  Presumably the file-like access can be implemented
on top of the sequential API as well.

If the sequential de/compression routines are indeed primitive, and
sufficient for the implementation of the other two APIs, then that
gives us the option of implementing the other "upper" two layers in
pure python, potentially simplifying the amount of extension code that
has to be written.  I see that as desirable, since it gives us options
for writing the upper two layers; in pure python, or by writing
extensions to the C code where available.

I seem to recall a number of ancilliary functions in zlib, such as
those for loading a compression dictionary.  There are also options
such as flushing the compression in order to be able to resynchronize
should part of the archive become garbled.  Where these functions are
available, they could be implemented, though it would be desirable to
give them the same name in each module to allow client code to test
for their existence in a compression-agnostic way.

For what it's worth, I would rather see a pythonic interface to the
libraries than a simple-as-can-be wrapper around the C functions.  I
personally find it annoying to have to drop down to non-OOP styles in
a python program in order to use a C library.  It doesn't matter to me
whether the OOP layer is added atop the C library in pure python or in
the C-to-python binding; that is an implementation detail to me, and I
suspect to most python programmers.  They don't care, they just want
it easy to use from python.  If performance turns out to matter, and
the underlying compression library supports an "upper layer" in C,
then we have the option for using that code.

So my suggestion is that we (the python users) brainstorm on how we
want the API to look, and not focus on the underlying library except
insofar as it informs our discussion of the proper APIs - for example,
features such as flushing state, setting compression levels/windows,
or for resynchronization points.

My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.

So my first suggestion on the stream de/compression API thread is:

The sequential de/compression needs to be capable of returning
more than just the de/compressed data.  It should at least be
capable of returning end-of-stream conditions and possibly
other states as well.  I see a few ways of implementing this:

1) The de/compression object holds state in various members such as
data input buffers, data output buffers, and a state for indicating
states such as synchronization points or end-of-stream states.  Member
functions are called and primarily manipulate the data members of the
object.

2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError.  My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise.  For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.

Thoughts?
-- 
Crypto ergo sum.  http://www.subspacefield.org/~travis/
Do unto other faiths as you would have them do unto yours.
If you are a spammer, please email john at subspacefield.org to get blacklisted.