On Thu, Apr 20, 2017 at 10:43 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

Apologies for confusing the terminology! Yes, this would mean a fixed number of bytes and an unknown number of characters. 
 
As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

It's already unsafe to try to insert arbitrary length strings into a numpy string_ or unicode_ array. When determining the dtype automatically (e.g., with np.array(list_of_strings)), the difference is that numpy would need to check the maximum encoded length instead of the character length (i.e., len(x.encode() instead of len(x)).

I certainly would not over-allocate. If users want more space, they can explicitly choose an appropriate size. (This is an hazard of not having length length dtypes.)

If users really want to be able to fit an arbitrary number of unicode characters and aren't concerned about memory usage, they can still use np.unicode_ -- that won't be going away.
 
So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

NumPy already does this sort of silent truncation with longer strings inserted into shorter string dtypes. The different here would indeed be the need to check the number of bytes represented by the string instead of the number of characters.

But I don't think this is useful behavior to bring over to a new dtype. We should error instead of silently truncating. This is certainly easier than trying to figure out when we would be splitting a character.
 
But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.

As Yu Feng points out in this GitHub comment, non-latin language speakers are already aware of the difference between string length and bytes length:
https://github.com/numpy/numpy/pull/8942#issuecomment-294409192

Making an API based on code units instead of code points really seems like the saner way to handle unicode strings. I agree with this section with the DyND design docs for it's string type, which notes precedent from Julia and Go:
https://github.com/libdynd/libdynd/blob/master/devdocs/string-design.md#code-unit-api-not-code-point

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

I think scientific users tend be to ASCII only, so UTF-8 would also work transparently :).