On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.
The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it would mostly "just work"



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception