[Numpy-discussion] genfromtxt and gzip

Tue Jun 11 19:43:38 EDT 2013

On 05.06.2013, at 9:52AM, Ted To <rainexpected at theo.to> wrote:

>> From the list archives (2011), I noticed that there is a bug in the
> python gzip module that causes genfromtxt to fail with python 2 but this
> bug is not a problem for python 3.  When I tried to use genfromtxt and
> python 3 with a gzip'ed csv file, I instead got:
> 
> IOError: Mode rbU not supported
> 
> Is this a bug?  I am using python 3.2.3 and numpy 1.7.1 from the
> experimental Debian repository.

Interesting, it used to be the other way round indeed - at least Python3's gzip
module was believed to work with 'U' mode (universal newline conversion).
This was apparently fixed in Python 2.7.3:
http://bugs.python.org/issue5148

but from the closing comment I'd take it should indeed _not_ be used in Python 3

"The data corruption issue is now fixed in the 2.7 branch.

In 3.x, using a mode containing 'U' results in an exception rather than silent data corruption. 
Additionally, gzip.open() has supported text modes ("rt"/"wt"/"at") and newline translation since 3.3"

Checking the various Python versions on OS X 10.8 I found:

2.6.8: fails similar to the older 2.x, i.e. gzip opens with 'rbU', but then fails upon reading
(possibly randomly) with
/sw/lib/python2.6/gzip.pyc in _read_eof(self)
    302         if crc32 != self.crc:
    303             raise IOError("CRC check failed %s != %s" % (hex(crc32),
--> 304                                                          hex(self.crc)))

2.7.5: works as to be expected with the resolution of 5148 above.

3.1.5: works as well, which could just mean that the exception mentioned above has not
made it into the 3.1.x branch…

3.2.5+3.3.2: gzip.open raises the exception as documented.

This looks like the original issue, that the universal newline conversion should not be passed
to gzip.open (where it is meaningless or even harmful) is still not resolved; ideally the 'U' flag
should probably be caught in _datasource.py.
I take it from the comments on issue 5148 that 3.3's gzip module offers alternative methods to
do the newline conversion, but for 3.1, 3.2 and 2.6 this might still have to be done within either
_datasource.py or genfromtxt; however I have no idea if anyone has come up with a good
solution for this by now…

Cheers,
						Derek