I hope you didn't mean to take this off-list:
On Fri, Jan 17, 2014 at 2:06 PM, Neil Schemenauer <nas@arctrix.com> wrote:
In gmane.comp.python.devel, you wrote:
> For the record, we've got a pretty good thread (not this good, though!)
> over on the numpy list about how to untangle the mess that has resulted
 
Not sure about your definition of good. ;-)

well, in the sense of "big" anyway...
 
 Could you summarize the main points on python-dev?  I'm not feeling up to wading through
another massive thread but I'm quite interested to hear the
challenges that numpy deals with. 

Well, not much new to it, really. But here's a re-cap:

numpy has had an 'S' dtype for a while, which corresponded to the py2 string type (except for being fixed length). So it could auto-convert to-from python strings... all was good and happy.

Enter py3: what to do? there is no py2 string type anymore. So it was decided to have the 'S' dtype correspond to the py3 bytes type. Apparently there was thought of renaming it, but the 'B' and 'b' type identifiers were already takes, so 'S' was kept.

However, as we all know in this thread, the py3 bytes type is not the same thing as a py2 string (or py2 bytes, natch), and folks like to use the 'S' type for text data -- so that is kind of broken in py3.

However, other folks use the 'S' type for binary data, so like (and rely on) it being mapped to the py3 bytes type. So we are stuck with that.

Given the nature of numpy, and scientific data, there is talk of having a one-byte-per-char text type in numpy (there is already a unicode type, but it uses 4-bytes-per-char, as it's key to the numpy data model that all objects of a given type are the same size.) This would be analogous to the current multiple precision options for numbers. It would take up less memory, and would not be able to hold all values. It's not clear what the level of support is for this right now -- after all, you can do everything you need to do with the appropriate calls to encode() and decode(), if a bit awkward.

Meanwhile, back at the ranch -- related, but separate issues have arisen with the functions that parse text files: numpy.loadtxt and numpy.genfromtxt. These functions were adapted for py3 just enough to get things to mostly work, but have some serious limitations when doing anything with unicode -- and in fact do some weird things with plain ascii text files if you ask it to create unicode objects, and that is a natural thing to do (and the "right" thing to do in the Py3 text model) if you do something like:

arr = loadtxt('a_file_name', dtype=str)

on py3, an str is a py3unicode string, so you get the numpy 'U' datatype but loadtxt wasn't designed to deal with that, so you can get stuff like:

["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
 "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
 "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]

This was (Presumably, I haven't debugged the code) due to conversion from bytes to unicode...(I'm still confused about the extra slashes)

And this ascii text -- it gets worse if there is non-ascii text in there.

Anyway, the truth is, this stuff is hard, but it will get at least a touch easier with PEP 461.

[though to be truthful, I'm not sure why someone put a comment in the issue tracker about b'%d'%some_num being an issue ... I'm not sure how when we're going from text to numbers, not the other way around...]

-Chris

























--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov