Sebastian Berg writes:
On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the string dtype ('S') under python 3 uses byte arrays instead of unicode (the only readable string type in python 3).
This brings two major problems:
* numpy code has to go through loops to open and read files as binary data to load text into a bytes array, and does not play well with users providing string (unicode) arguments
* the repr of these arrays shows strings as b'text' instead of 'text', which breaks doctests of software built on numpy
What I'm trying to do is make dtypes 'S' and 'U' equivalnt (NPY_STRING and NPY_UNICODE).
Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal implementation) will provide the best backwards compatibility, but is more cumbersome to implement.
I am not sure how that can be possible. Those types are fundamentally different in how they store their data. String types use one byte per character, unicode types will use 4 bytes per character. You can maybe default to unicode in more cases in python 3, but you cannot make them identical internally.
BTW, by identical I mean having two externally visible types, but a common implementation in python 3 (that of NPY_UNICODE). The as-sane but not backwards-compatible option (I'm asking if this is acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE implementation, and making 'U' (and np.unicode_) and alias for 'S' (and np.string_). Cheers, Lluis