[Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer shoyer at gmail.com
Thu Apr 20 13:26:13 EDT 2017

Julian -- thanks for taking this on. NumPy's handling of strings on Python
3 certainly needs fixing.

On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>

> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type. Instead, for encoded Unicode, the string could
> be truncated so that the encoding fits. Of course this is not completely
> trivial for variable-length encodings, but it should be doable, and it
> would allow UTF-8 to be used just the way it usually is - as an encoding
> that's almost 8-bit.

I agree with Anne here. Variable-length encoding would be great to have,
but even fixed length UTF-8 (in terms of memory usage, not characters)
would solve NumPy's Python 3 string problem. NumPy's memory model needs a
fixed size per array element, but that doesn't mean we need a fixed sized
per character. Each element in a UTF-8 array would be a string with a fixed
number of codepoints, not characters.

In fact, we already have this sort of distinction between element size and
memory usage: np.string_ uses null padding to store shorter strings in a
larger dtype.

The only reason I see for supporting encodings other than UTF-8 is for
memory-mapping arrays stored with those encodings, but that seems like a
lot of extra trouble for little gain.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170420/73e41b16/attachment-0001.html>

More information about the NumPy-Discussion mailing list