[Numpy-discussion] npyio -> gzip 271 Error -3 while decompressing ?

Mon Jun 20 21:06:37 EDT 2011

Moin Denis,

On 20 Jun 2011, at 19:04, denis wrote:
>  a separate question, have you run genfromtxt( "xx.csv.gz" ) lately ?

I haven't, and I was not particularly involved with it before this  
patch, so this would possibly be better addressed to the list.

> On on .../scikits.learn-0.8/scikits/learn/datasets/data.digits.csv.gz
> numpy 1.6.0, py 2.6 mac I get
>
>    X = np.genfromtxt( filename, delimiter="," )
>  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
> python2.6/site-packages/numpy/lib/npyio.py", line 1271, in genfromtxt
>    first_line = fhd.next()
>  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
> python2.6/gzip.py", line 438, in next
>    line = self.readline()
>  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
> python2.6/gzip.py", line 393, in readline
>    c = self.read(readsize)
>  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
> python2.6/gzip.py", line 219, in read
>    self._read(readsize)
>  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
> python2.6/gzip.py", line 271, in _read
>    uncompress = self.decompress.decompress(buf)
> zlib.error: Error -3 while decompressing: invalid distance too far
> back
>
> It would be nice to fix this too, if it hasn't been already.
> Btw the file gunzips fine.

I could reproduce that error for the gzip'ed csv files in that  
directory; it can be isolated to the underlying gzip call above -
fhd = gzip.open('digits.csv.gz', 'rbU'); fhd.next()
produces the same error for these files with all python2.x versions on  
my Mac, but not with python3.x. Also only with the 'U' mode specified,  
yet the same mode is parsing other .gz files just fine. I could not  
really track down what the 'U' flag is doing in gzip.py, but I assume  
it is specifying some kind of unbuffered read. Also it's a mystery to  
me what is different in those files that would trigger the error. I  
even read them in with loadtxt() and wrote them back using constant  
line width and/or spaces as separators, still producing the same  
exception.
The obvious place to fix this (or work around a bug in python2's  
gzip.py, whatever), would be changing the open command in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
to omit the 'U' at least with python2. Alternatively one could do a  
test read and catch the exception, to then re-open the file with mode  
'rb'...
Pierre, if you are reading this, can you comment how important the 'U'  
is for performance considerations or such?

HTH,
								Derek