[Numpy-discussion] genfromtxt universal newline support

Derek Homeier derek at astro.physik.uni-goettingen.de
Mon Jun 30 04:31:03 EDT 2014


Hi all,

I was just having a new look into the mess that is, imo, the support for automatic
line ending recognition in genfromtxt, and more generally, the Python file openers.
I am glad at least reading gzip files is no longer entirely broken in Python3, but
actually detecting in particular “old Mac” style CR line endings currently only work
for uncompressed and bzip2 files under 2.6/2.7.
This is largely because genfromtxt wants to open everything in binary mode,
which arguably makes no sense for ASCII text files with numbers. I think the
only reason this works in 2.x is that the ‘U’ reading mode overrides the ‘b’.

So on the Python side what actually works for automatic line ending detection is:

Python			2.6	2.7	3.2	3.3/3.4
uncompressed:	U	U	t	t
gzip:			E	N	E	t
bzip2:		U	U	E	t*
lzma:		-	-	-	t*

U - works with mode ‘rU’
E - mode ‘rU’ raises an error
N - mode ‘rU’ is accepted, but does not detect CR (‘\r’) line endings
	(actually I think ‘U’ is simply internally discarded by gzip.open() in 2.7.4+)
t - works with mode ‘rt’ (default with plain open())
	- * means requires the '.open()' rather than the '.XXXFile()' method of bz2/lzma

Therefore I’d propose the changes in
https://github.com/dhomeier/numpy/commit/995ec93

to extend universal newline recognition as far as possible with the above openers.
There are some potential issues with this:

1. Switching to ‘rt’ mode for Python3.x  means that np.lib._datasource.open() does not
return byte strings by itself, so genfromtxt has to use asbytes() on the returned lines.
Since this occurs only in two places, I don’t see a major problem with this.
2. In the tests I had to work around the lack of fileobj support in bz2.BZ2File by using
os.system(‘bzip2 …’) on the temporary file, which might not work on all systems.
In particular I’d expect it to fail under Windows, but it’s not clear to me how far the entire
mkstemp thing works under Windows...

As a final note, http://bugs.python.org/issue13989#msg153127 suggests a workaround
that might make this work with gzip.open() (and perhaps bz2?) on 3.2 as well.
I am not sure how high 3.2 support is ranking for the near future; for the moment I am not
strongly inclined to implement it…

Grateful for comments or tests (especially under Windows!) of the commit(s) above -

						Derek




More information about the NumPy-Discussion mailing list