[Numpy-discussion] String & unicode arrays vs text loading in python 3

Tue Sep 13 10:17:51 EDT 2016

Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

> What about giving `np.loadtxt` an encoding kwarg or something along
> that line?

np.loadtxt and np.genfromtxt are already quite complex in handling the implicit
conversion to byte-array imposed by numpy's port to python 3, and still fail in
some corner cases.

This conversion is also inherently surprising to users, since what I'd get in
python 2:

  >>> np.array('foo', dtype='S')
  array('foo', dtype='|S3')

In python 3 gives me a surprising (note the prefix on the resulting string):

  >>> np.array('foo', dtype='S')
  array(b'foo', dtype='|S3')

It's not only surprising, but also breaks absolutely all the doctests I have
with arrays that contain strings (it even breaks numpy's examples).

That's why adding an encoding kwarg (better than the current auto-magical
conversion to binary) won't solve my problems. The 'S' dtype will still be a
binary array, which shows up in the repr.

Since all strings in python 3 are unicode, I'm expecting "string" and "unicode"
arrays in numpy to be the same *and* show up as strings (e.g., 'foo' instead of
b'foo').

Yes, the difference between these types is in how they store their data. What
I'm proposing is to always use unicode in python 3.

If necessary, we can add a new dtype that lets users store raw byte arrays. By
making them explicitly byte arrays, that shouldn't raise any new surprises.

I already started doing the changes I described (as a result from the discussion
in #3184 [1]), but wanted to double-check with the list before getting deeper
into it.

[1] https://github.com/numpy/numpy/issues/3184

Cheers,
  Lluis