[Numpy-discussion] proposal: smaller representation of string arrays
shoyer at gmail.com
Thu Apr 20 13:26:13 EDT 2017
Julian -- thanks for taking this on. NumPy's handling of strings on Python
3 certainly needs fixing.
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>
> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type. Instead, for encoded Unicode, the string could
> be truncated so that the encoding fits. Of course this is not completely
> trivial for variable-length encodings, but it should be doable, and it
> would allow UTF-8 to be used just the way it usually is - as an encoding
> that's almost 8-bit.
I agree with Anne here. Variable-length encoding would be great to have,
but even fixed length UTF-8 (in terms of memory usage, not characters)
would solve NumPy's Python 3 string problem. NumPy's memory model needs a
fixed size per array element, but that doesn't mean we need a fixed sized
per character. Each element in a UTF-8 array would be a string with a fixed
number of codepoints, not characters.
In fact, we already have this sort of distinction between element size and
memory usage: np.string_ uses null padding to store shorter strings in a
The only reason I see for supporting encodings other than UTF-8 is for
memory-mapping arrays stored with those encodings, but that seems like a
lot of extra trouble for little gain.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion