genfromtxt universal newline support
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi all, I was just having a new look into the mess that is, imo, the support for automatic line ending recognition in genfromtxt, and more generally, the Python file openers. I am glad at least reading gzip files is no longer entirely broken in Python3, but actually detecting in particular “old Mac” style CR line endings currently only work for uncompressed and bzip2 files under 2.6/2.7. This is largely because genfromtxt wants to open everything in binary mode, which arguably makes no sense for ASCII text files with numbers. I think the only reason this works in 2.x is that the ‘U’ reading mode overrides the ‘b’. So on the Python side what actually works for automatic line ending detection is: Python 2.6 2.7 3.2 3.3/3.4 uncompressed: U U t t gzip: E N E t bzip2: U U E t* lzma: - - - t* U - works with mode ‘rU’ E - mode ‘rU’ raises an error N - mode ‘rU’ is accepted, but does not detect CR (‘\r’) line endings (actually I think ‘U’ is simply internally discarded by gzip.open() in 2.7.4+) t - works with mode ‘rt’ (default with plain open()) - * means requires the '.open()' rather than the '.XXXFile()' method of bz2/lzma Therefore I’d propose the changes in https://github.com/dhomeier/numpy/commit/995ec93 to extend universal newline recognition as far as possible with the above openers. There are some potential issues with this: 1. Switching to ‘rt’ mode for Python3.x means that np.lib._datasource.open() does not return byte strings by itself, so genfromtxt has to use asbytes() on the returned lines. Since this occurs only in two places, I don’t see a major problem with this. 2. In the tests I had to work around the lack of fileobj support in bz2.BZ2File by using os.system(‘bzip2 …’) on the temporary file, which might not work on all systems. In particular I’d expect it to fail under Windows, but it’s not clear to me how far the entire mkstemp thing works under Windows... As a final note, http://bugs.python.org/issue13989#msg153127 suggests a workaround that might make this work with gzip.open() (and perhaps bz2?) on 3.2 as well. I am not sure how high 3.2 support is ranking for the near future; for the moment I am not strongly inclined to implement it… Grateful for comments or tests (especially under Windows!) of the commit(s) above - Derek
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
genfromtxt and loadtxt need an almost full rewrite to fix the botched python3 conversion of these functions. There are a couple threads about this on this list already. There are numerous PRs fixing stuff in these functions which I currently all -1'd because we need to fix the underlying unicode issues first. I have a PR were I started this for loadtxt but it is incredibly annoying to try to support all the broken use cases the function accidentally supported. 1.9 beta still uses the broken functions because I had no time to get this done correctly. But we should probably put a big fat future warning into the release notes that genfromtxt and loadtxt may stop working for your binary streams. That will probably allow us to start fixing these functions.
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 12:33 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
+1 to doing the proper fix instead of piling up buggy hacks. Do we understand the difference between the current code and the "proper" code well enough to detect cases where they differ and issue warnings in those cases specifically? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30 Jun 2014, at 04:39 pm, Nathaniel Smith <njs@pobox.com> wrote:
What binary streams?
Does it make sense to keep maintaing both functions at all? IIRC the idea that loadtxt would be the faster version of the two has been discarded long ago, thus it seems there is very little, if anything, loadtxt can do that cannot be done just as well by genfromtxt. Main compatibility issue is probably different default behaviour and interface of the two, but perhaps that might be best solved by replacing loadtxt with another genfromtxt wrapper? A real need, which had also been discussed at length, is a truly performant text IO function (i.e. one using a compiled ASCII number parser, and optimally also a more memory-efficient one), but unfortunately all people interested in implementing this seem to have drifted away (not excluding myself from this)… Cheers, Derek
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 3:47 PM, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote:
It's possible we could steal some code from Pandas for this. IIRC they have C/Cython text parsing routines. (It's also an interesting question whether they've fixed the unicode/binary issues, might be worth checking before rewriting from scratch...) -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
It's also an interesting question whether they've fixed the unicode/binary issues,
Which brings up the "how do we handle text/strings in numpy? issue. We had a good thread going here about what the 'S' data type should be , what with py3 and all, but I don't think we ever really resolved that. IIRC, the key issue was whether we should have a "proper" one-byte-per-character text type -- after all, ASCI/ANSI text is pretty common in scientific data sets, and 4 bytes per char is a fair bit of overhead. Anyway, this all ties in with the text file parsing issues... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 9:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 30 Jun 2014 17:05, "Chris Barker" <chris.barker@noaa.gov> wrote:
Anyway, this all ties in with the text file parsing issues...
Only tangentially though :-)
well, a fast text parser (and "text mode") input file will either need to deal with Unicode properly or not. But your point is well taken. We did have a good thread about his a few months back, which resulted in the usual thing of kind of withering away with no decision or action. But I've added it to the list to talk about at SciPy... -Chris
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30 Jun 2014, at 04:56 pm, Nathaniel Smith <njs@pobox.com> wrote:
Good point, last time I was playing with Pandas it was not any faster, but now a 10x speedup speaks for itself. Their C engine does not support generic whitespace separators, but that could probably be addressed in a numpy implementation. Derek
![](https://secure.gravatar.com/avatar/0b4fd44291b631aa53752e29e8e1d0ef.jpg?s=120&d=mm&r=g)
In pandas 0.14.0, generic whitespace IS parsed via the c-parser, e.g. specifying '\s+' as a separator. Not sure when you were playing last with pandas, but the c-parser has been in place since late 2012. (version 0.8.0) http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#text-parsing-a...
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30.06.2014, at 23:10, Jeff Reback <jeffreback@gmail.com> wrote:
In pandas 0.14.0, generic whitespace IS parsed via the c-parser, e.g. specifying '\s+' as a separator. Not sure when you were playing last with pandas, but the c-parser has been in place since late 2012. (version 0.8.0)
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#text-parsing-a...
Ah, I did not see the '\s' syntax in the documentation and thought ' *' would be the only option. Thanks, Derek
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
genfromtxt and loadtxt need an almost full rewrite to fix the botched python3 conversion of these functions. There are a couple threads about this on this list already. There are numerous PRs fixing stuff in these functions which I currently all -1'd because we need to fix the underlying unicode issues first. I have a PR were I started this for loadtxt but it is incredibly annoying to try to support all the broken use cases the function accidentally supported. 1.9 beta still uses the broken functions because I had no time to get this done correctly. But we should probably put a big fat future warning into the release notes that genfromtxt and loadtxt may stop working for your binary streams. That will probably allow us to start fixing these functions.
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 12:33 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
+1 to doing the proper fix instead of piling up buggy hacks. Do we understand the difference between the current code and the "proper" code well enough to detect cases where they differ and issue warnings in those cases specifically? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30 Jun 2014, at 04:39 pm, Nathaniel Smith <njs@pobox.com> wrote:
What binary streams?
Does it make sense to keep maintaing both functions at all? IIRC the idea that loadtxt would be the faster version of the two has been discarded long ago, thus it seems there is very little, if anything, loadtxt can do that cannot be done just as well by genfromtxt. Main compatibility issue is probably different default behaviour and interface of the two, but perhaps that might be best solved by replacing loadtxt with another genfromtxt wrapper? A real need, which had also been discussed at length, is a truly performant text IO function (i.e. one using a compiled ASCII number parser, and optimally also a more memory-efficient one), but unfortunately all people interested in implementing this seem to have drifted away (not excluding myself from this)… Cheers, Derek
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 3:47 PM, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote:
It's possible we could steal some code from Pandas for this. IIRC they have C/Cython text parsing routines. (It's also an interesting question whether they've fixed the unicode/binary issues, might be worth checking before rewriting from scratch...) -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
It's also an interesting question whether they've fixed the unicode/binary issues,
Which brings up the "how do we handle text/strings in numpy? issue. We had a good thread going here about what the 'S' data type should be , what with py3 and all, but I don't think we ever really resolved that. IIRC, the key issue was whether we should have a "proper" one-byte-per-character text type -- after all, ASCI/ANSI text is pretty common in scientific data sets, and 4 bytes per char is a fair bit of overhead. Anyway, this all ties in with the text file parsing issues... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Mon, Jun 30, 2014 at 9:31 AM, Nathaniel Smith <njs@pobox.com> wrote:
On 30 Jun 2014 17:05, "Chris Barker" <chris.barker@noaa.gov> wrote:
Anyway, this all ties in with the text file parsing issues...
Only tangentially though :-)
well, a fast text parser (and "text mode") input file will either need to deal with Unicode properly or not. But your point is well taken. We did have a good thread about his a few months back, which resulted in the usual thing of kind of withering away with no decision or action. But I've added it to the list to talk about at SciPy... -Chris
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30 Jun 2014, at 04:56 pm, Nathaniel Smith <njs@pobox.com> wrote:
Good point, last time I was playing with Pandas it was not any faster, but now a 10x speedup speaks for itself. Their C engine does not support generic whitespace separators, but that could probably be addressed in a numpy implementation. Derek
![](https://secure.gravatar.com/avatar/0b4fd44291b631aa53752e29e8e1d0ef.jpg?s=120&d=mm&r=g)
In pandas 0.14.0, generic whitespace IS parsed via the c-parser, e.g. specifying '\s+' as a separator. Not sure when you were playing last with pandas, but the c-parser has been in place since late 2012. (version 0.8.0) http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#text-parsing-a...
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 30.06.2014, at 23:10, Jeff Reback <jeffreback@gmail.com> wrote:
In pandas 0.14.0, generic whitespace IS parsed via the c-parser, e.g. specifying '\s+' as a separator. Not sure when you were playing last with pandas, but the c-parser has been in place since late 2012. (version 0.8.0)
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#text-parsing-a...
Ah, I did not see the '\s' syntax in the documentation and thought ' *' would be the only option. Thanks, Derek
participants (5)
-
Chris Barker
-
Derek Homeier
-
Jeff Reback
-
Julian Taylor
-
Nathaniel Smith