[Numpy-discussion] Text array dtype for numpy

Sat Jan 25 17:45:13 EST 2014

On 24 January 2014 22:43, Chris Barker <chris.barker at noaa.gov> wrote:
> Oscar,
>
> Cool stuff, thanks!
>
> I'm wondering though what the use-case really is.

The use-case is precisely the use-case for dtype='S' on Py2 except
that it also works on Py3.

> The P3 text  model
> (actually the py2 one, too), is quite clear that you want users to think of,
> and work with, text as text -- and not care how things are encoding in the
> underlying implementation. You only want the user to think about encodings
> on I/O -- transferring stuff between systems where you can't avoid it. And
> you might choose different encodings based on different needs.

Exactly. But what you're missing is that storing text in a numpy array
is putting the text into bytes and the encoding needs to be specified.
My proposal involves explicitly specifying the encoding. This is the
key point about the Python 3 text model: it is not that encoding isn't
automatic (e.g. when you print() or call file.write with a text file);
the point is that there must never be ambiguity about the encoding
that is used when encode/decode occurs.

> So why have a different, the-user-needs-to-think-about-encodings numpy
> dtype? We already have 'U' for full-on unicode support for text. There is a
> good argument for a more compact internal representation for text compatible
> with one-byte-per-char encoding, thus the suggestion for such a dtype. But I
> don't see the need for quite this. Maybe I'm not being a creative enough
> thinker.

Because users want to store text in a numpy array and use less than 4
bytes per character. You expressed a desire for this. The only
difference between this and your latin-1 suggestion is that this one
has an explicit encoding that is visible to the user and that you can
choose that encoding to be anything that your Python installation
supports.

> Also, we may want numpy to interact at a low level with other libs that
> might have binary encoded text (HDF, etc) -- in which case we need a bytes
> dtype that can store that data, and perhaps encoding and decoding ufuncs.

Perhaps there is a need for a bytes dtype as well. But not that you
can use textarray with encoding='ascii' to satisfy many of these use
cases. So h5py and pytables can expose an interface that stores text
as bytes but has a clearly labelled (and enforced) encoding.

> If we want a more efficient and compact unicode implementation  then the py3
> one is a good  place to start -it's pretty slick! Though maybe harder to due
> in numpy as text in numpy probably wouldn't be immutable.

It's not a good fit for numpy because numpy arrays expose their memory
buffer. More on this below but if there was to be something as drastic
as the FSR then it would be better to think about how to make an
ndarray type that is completely different, has an opaque memory buffer
and can handle arbitrary length text strings.

>> To make a slightly more concrete proposal, I've implemented a pure
>> Python ndarray subclass that I believe can consistently handle
>> text/bytes in Python 3.
>
> this scares me right there -- is it text or bytes??? We really don't want
> something that is both.

I believe that there is a conceptual misunderstanding about what a
numpy array is here.

A numpy array is a clever view onto a memory buffer. A numpy array
always has two interfaces, one that describes a memory buffer and one
that delivers Python objects representing the abstract quantities
described by each portion of the memory buffer. The dtype specifies
three things:
1) How many bytes of the buffer are used.
2) What kind of abstract object this part of the buffer represents.
3) The mapping from the bytes in this segment of the buffer to the
abstract object.

As an example:

>>> import numpy as np
>>> a = np.array([1, 2, 3], dtype='<u4')
>>> a
array([1, 2, 3], dtype=uint32)
>>> a.tostring()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

So what is this array? Is it bytes or is it integers? It is both. The
array is a view onto a memory buffer and the dtype is the encoding
that describes the meaning of the bytes in different segments. In this
case the dtype is '<u4'. This tells us that we need 4 bytes per
segment, that each segment represents an integer and that the mapping
from byte segments to integers is the unsigned little-endian mapping.

How can we do the same thing with text? We need a way to map text to
fixed-width bytes. Mapping text to bytes is done with text encodings.
So we need a dtype that incorporates a text encoding in order to
define the relationship between the bytes in the array's memory buffer
and the abstract entity that is a sequence of Unicode characters.
Using dtype='U' doesn't get around this:

>>> a = np.array(['qwe'], dtype='U')
>>> a
array(['qwe'],
      dtype='<U3')
>>> a[0] # text
'qwe'
>>> a.tostring() # bytes
b'q\x00\x00\x00w\x00\x00\x00e\x00\x00\x00'

In my proposal you'd get the same by using 'utf-32-le' as the encoding
for your text array.

>> The idea is that the array has an encoding. It stores strings as
>> bytes. The bytes are encoded/decoded on insertion/access. Methods
>> accessing the binary content of the array will see the encoded bytes.
>> Methods accessing the elements of the array will see unicode strings.
>>
>> I believe it would not be as hard to implement as the proposals for
>> variable length string arrays.
>
> except that with some encodings, the number of bytes required is a function
> of what the content of teh text is -- so it either has to be variable
> length, or a fixed number of bytes, which is not a fixed number of
> characters  which require both careful truncation (a pain), and surprising
> results for users  "why can't I fit 10 characters is a length-10 text
> object? And I can if they are different characters?)

It should be a fixed number of bytes. It does mean that 10 characters
might not fit into a 10-byte text portion but there's no way around
that if it is a fixed length and the encoding is variable-width. I
don't really think that this is much of a problem though. Most use
cases are probably going to use 'ascii' anyway. The improvement those
use-cases get is error detection for non-ascii characters and
explicitly labelled encodings, rather than mojibake.

>> The one caveat is that it will strip
>> null characters from the end of any string.
>
> which is fatal, but you do want a new dtype after all, which presumably
> wouldn't do that.

Why is that fatal for text (not arbitrary byte strings)? There are
many other reasons (relating to other programming languages and
software) why you can't usually put null characters into text anyway.

I don't really see how to get around this if the bytes must go into
fixed-width portions without an out-of-band way to specify the length
of the string.

Oscar