if you truncate a utf-8 bytestring, you may get invalid data

Note that in general truncating unicode codepoints is not a safe operation either, as combining characters are a thing. So I don't think this is a good argument against UTF8.

Also, is silent truncation a think that we want to allow to happen anyway? That sounds like something the user ought to be alerted to with an exception.

> if you wanted to specify that a numpy element would be able to hold, say, N characters
> ...
It simply is not the right way to handle text if [...] you need fixed-length storage

It seems to me that counting code points is pretty futile in unicode, due to combining characters. The only two meaningful things to count are:
* Graphemes, as that's what the user sees visually. These can span multiple code-points
* Bytes of encoded data, as that's the space needed to store them

So I would argue that the approach of fixed-codepoint-length storage is itself a flawed design, and so should not be used as a constraint on numpy.

Counting graphemes is hard, so that leaves the only sensible option as a byte count.

I don't forsee variable-length encodings being a problem implementation-wise - they only become one if numpy were to acquire a vectorized substring function that is intended to return a view.

I think I'd be in favor of supporting all encodings, and falling back on python to handle encoding/decoding them.


On Thu, 20 Apr 2017 at 18:44 Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Apr 20, 2017 at 10:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
I agree with Anne here. Variable-length encoding would be great to have, but even fixed length UTF-8 (in terms of memory usage, not characters) would solve NumPy's Python 3 string problem. NumPy's memory model needs a fixed size per array element, but that doesn't mean we need a fixed sized per character. Each element in a UTF-8 array would be a string with a fixed number of codepoints, not characters.

Ah, yes -- the nightmare of Unicode!

No, it would not be a fixed number of codepoints -- it would be a fixed number of bytes (or "code units"). and an unknown number of characters.

As Julian pointed out, if you wanted to specify that a numpy element would be able to hold, say, N characters (actually code points, combining characters make this even more confusing) then you would need to allocate N*4 bytes to make sure you could hold any string that long. Which would be pretty pointless -- better to use UCS-4.

So Anne's suggestion that numpy truncates as needed would make sense -- you'd specify say N characters, numpy would arbitrarily (or user specified) over-allocate, maybe N*1.5 bytes, and you'd truncate if someone passed in a string that didn't fit. Then you'd need to make sure you truncated correctly, so as not to create an invalid string (that's just code, it could be made correct).

But how much to over allocate? for english text, with an occasional scientific symbol, only a little. for, say, Japanese text, you'd need a factor 2 maybe? 

Anyway, the idea that "just use utf-8" solves your problems is really dangerous. It simply is not the right way to handle text if:

you need fixed-length storage
you care about compactness

In fact, we already have this sort of distinction between element size and memory usage: np.string_ uses null padding to store shorter strings in a larger dtype.

sure -- but it is clear to the user that the dtype can hold "up to this many" characters.
 
The only reason I see for supporting encodings other than UTF-8 is for memory-mapping arrays stored with those encodings, but that seems like a lot of extra trouble for little gain.

I see it the other way around -- the only reason TO support utf-8 is for memory mapping with other systems that use it :-)

On the other hand,  if we ARE going to support utf-8 -- maybe use it for all unicode support, rather than messing around with all the multiple encoding options.

I think a 1-byte-per char latin-* encoded string is a good idea though -- scientific use tend to be latin only and space constrained.

All that being said, if the truncation code were carefully written, it would mostly "just work"

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion