Julian -- thanks for taking this on. NumPy's handling of strings on Python 3 certainly needs fixing. On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit.
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters. In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype. The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.