[Python-bugs-list] [Bug #124981] zlib decompress of sync-flushed data fails

Mon, 22 Jan 2001 15:31:45 -0800

Bug #124981, was updated on 2000-Dec-07 23:25
Here is a current snapshot of the bug.

Project: Python
Category: Documentation
Status: Open
Resolution: None
Bug Group: None
Priority: 5
Submitted by: abo
Assigned to : akuchling
Summary: zlib decompress of sync-flushed data fails

Details: I'm not sure if this is just an undocumented limitation or a
genuine bug. I'm using python 1.5.2 on winNT.

A single decompress of a large amount (16K+) of compressed data that has
been sync-flushed fails to produce all the data up to the sync-flush. The
data remains inside the decompressor untill further compressed data or a
final flush is issued. Note that the 'unused_data' attribute does not show
that there is further data in the decompressor to process (it shows ''). 

A workaround is to decompress the data in smaller chunks. Note that
compressing data in smaller chunks is not required, as the problem is in
the decompressor, not the compressor.

The following code demonstrates the problem, and raises an exception when
the compressed data reaches 17K;

from zlib import *
from random import *

# create compressor and decompressor
c=compressobj(9)
d=decompressobj()

# try data sizes of 1-63K
for l in range(1,64):
    # generate random data stream
    a=''
    for i in range(l*1024):
        a=a+chr(randint(0,255))
    # compress, sync-flush, and decompress
    t=d.decompress(c.compress(a)+c.flush(Z_SYNC_FLUSH))
    # if decompressed data is different to input data, barf,
    if len(t) != len(a):
        print len(a),len(t),len(d.unused_data)
        raise error

Follow-Ups:

Date: 2001-Jan-22 15:31
By: abo

Comment:
I've got a patch that I'm currently testing. When it's ready how do you
want me to submit it; through the SF patch manager?

I've analysed the code pretty extensively by now, and I believe I found a
few other potential problems (though I haven't bothered to create test
cases to trigger them). The original problem I found could affect
compression as well as decompression. There were some memory allocations
that didn't raise exceptions if they failed, and potential memory leaks
when exceptions did occur. Also, deflateEnd would be called twice in a
decompression object if it was flushed before it was destroyed.

The strange thing is compression, decompression, and both flush routines
perform basicly the same function, but they were all structured totally
differently (like they had been written by different people). Also, there
didn't seem to be any consistancy in the coding style so I didn't know what
style to use. I tried to resist fixing more than the initial problem I
identified, but in the end couldn't help myself.

As an aside, I noticed that the 2.0 version of zlibmodule.c is Python 2.0
specific (It wouldn't work on my 1.5.2 system so I had to compile 2.0 to
test my patch).

How do you want me to submit my patch; one big patch, or several smaller
ones for each fix? I have been using prcs to track my changes, but I have
been a bit lax in checking in each small change, so creating seperate
patches will take a little work.
-------------------------------------------------------

Date: 2001-Jan-20 06:22
By: abo

Comment:
OK, Christmas is over, and I didn't really get a chance to look at it
much.

However, some preliminary testing of the zlib library suggests that it is
_not_ in zlib. A c program that does the same compress/decompress of random
data using steadily increasing block sizes does not show the same
fault...

I've noticed what might be a potential problem in the inflateInit2 call in
PyZlib_decompressobj. The next_in and avail_in are not being initialized
before it's called. The zlib manual says they should be init'ed because
deflateInit will attempt to determine the compression method if it can. I'm
not sure why this hasn't tripped already... does PyObject_New zero memory
when it allocates things?

I also think I've found the main problem... the PyZlib_objdecompress
routine's main decompression loop thinks decompression is compelete when
all available input has been processed. However, it is possible that all
available input has been processed but there is more pending output. The
code as it stands does correctly allocate more space when this occurs, but
the loop terminates without calling inflate one more time.

I'm putting together a patch tomorrow. Right now I'm going to bed...
-------------------------------------------------------

Date: 2000-Dec-20 15:17
By: abo

Comment:
I have had a look at this in more detail (Python C interfacing is actually
pretty easy :-).

I dunno whether to go into details here, but I have noticed that inflate is
being called with Z_NO_FLUSH when the zlib docs suggest Z_SYNC_FLUSH or
Z_FINISH. However, the zlib implementation only treats Z_FINISH
differently, so this should not make a difference (but it might in future
versions of zlib).

As you have said, .unused_data will only contain data if something is
appended to a complete compressed object in the stream fed to a
decompressor. Perhaps the docs need something to clarify this, perhaps in
the form of an example?

I am going to write some test progs to see if it's in zlib. At this stage I
suspect that it is, and I'm hoping over christmas to get a patch done.

-------------------------------------------------------

Date: 2000-Dec-19 11:48
By: akuchling

Comment:
.unused_data is really a red herring; the PyZlib_objdecompress()
loops until zst->avail_in is zero,  so .unused_data must always be zero by
definition.  (The attribute is there to support gzip-format files that may
contain multiple compressed streams concatenated together.)

I still have no idea what the documentation should say; "don't pass more
than 16K of  compressed data when you're expecting a sync-flush."  I can't
see a way to explain this coherently without a big long explanation that
will confuse people who don't care about this problem.  (Add a special
note, or known bugs subsection, maybe?)

A simple C test program should be written, in order to check if it's 
the zlib library itself that's doing this.    

-------------------------------------------------------

Date: 2000-Dec-18 20:13
By: fdrake

Comment:
Andrew, please summarize what doc changes are needed, or make the changes
(whichever is easier for you is fine).
-------------------------------------------------------

Date: 2000-Dec-12 15:18
By: abo

Comment:
Further comments...

After looking at the C code, a few things became clear; I need to read more
about C/Python interfacing, and the "unused_data" attribute will only
contain data if additional data is fed to a de-compressor at the end of a
complete compressed stream.

The purpose of the "unused_data" attribute is not clear in the
documentation, so that should probably be clarified (mind you, I am looking
at pre-2.0 docs so maybe it already has?).

The failure to produce all data up to a sync-flush is something else... I'm
still looking into it. I'm not sure if it is an inherent limitation of
zlib, something that needs to be fixed in zlib, or something that needs to
be fixed in the python interface. If it is an inherent limitation, I'd like
to characterise it a bit better before documenting it. If it is something
that needs to be fixed in either zlib or the python interface, I'd like to
fix it.

Unfortunately, this is a bit beyond me at the moment, mainly in time, but
also a bit in skill (need to read the python/C interfacing documentation).
Maybe over the christmas holidays I'll get a chance to fix it.

-------------------------------------------------------

Date: 2000-Dec-12 13:32
By: gvanrossum

Comment:
OK, assigned to Fred. You may ask Andrew what to write. :-)
-------------------------------------------------------

Date: 2000-Dec-08 14:50
By: abo

Comment:
I'm not that sure I'm happy with it just being marked closed. AFAIKT, the
implementation definitely doesn't do what the documentation says, so to
save people like me time when they hit it, I'prefer the bug at least be
assigned to documentation so that the limitation is documented.

>From my reading of the documentation as it stands, the fact that there is
more pending data in the decompressor should be indicated by it's
"unused_data" attribute. The tests seem to show that "decompress()" is only
processing 16K of compressed data each call, which would suggest that
"unused_data" should contain the rest. However, in all my tests that
attribute has always been empty. Perhaps the bug is in there somewhere?

Another slight strangeness, even if "unused_data" did contain something,
the only way to get it out is by feeding in more compressed data, or
issuing a flush(), thus ending the decompression...

I guess that since I've been bitten by this, it's up to me to fix it. I've
got the source to 2.0 and I'll have a look and see if I can submit a
patch.

<sigh> and I was coding this app in python to avoid coding in C :-)
-------------------------------------------------------

Date: 2000-Dec-08 09:26
By: akuchling

Comment:
Python 2.0 demonstrates the problem, too.

I'm not sure what this is: a zlibmodule bug/oversight or
simply problems with zlib's API.  Looking at zlib.h, 
it implies that you'd have to call inflate() with the
flush parameter set to Z_SYNC_FLUSH to get the remaining data. 
Unfortunately this doesn't seem to help -- .flush() method doesn't support
an argument, but when I patch zlibmodule.c to allow one,
.flush(Z_SYNC_FLUSH) always fails with a -5: buffer error, perhaps because
it expects there to be some new data.

(The DEFAULTALLOC constant in zlibmodule.c is 16K, but this 
seems to be unrelated to the problem showing up with more than 16K of data,
since changing DEFAULTALLOC to 32K or 1K makes no difference to the size of
data at which the bug shows up.)

In short, I have no idea what's at fault, or if it can or should be fixed. 
Unless you or someone else submits a patch, I'll just leave it alone, and
mark this bug as closed and "Won't fix".

-------------------------------------------------------

Date: 2000-Dec-08 07:44
By: gvanrossum

Comment:
I *think* this may have been fixed in Python 2.0.

I'm assigning this to Andrew who can confirm that and close the bug report
(if it is fixed).
-------------------------------------------------------

Date: 2000-Dec-07 23:28
By: abo

Comment:
Argh... SF killed all my indents... sorry about that. You should be able to
figure it out, but if not email me and I can send a copy.
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=124981&group_id=5470