On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker@noaa.gov> wrote:
>
> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
>
>>> In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).
>>
>>
>> Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.
>
> Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.

We have the freedom to make the error message not suck. :-)

> When len(a_string) <= 12
>
> AND that will only occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either.

If this becomes the right strategy for dealing with these problems (and I'm not sure that it is), we can easily make a utility function that does this for people.

This discussion is why I want to be sure that we have our use cases actually mapped out. For this kind of in-memory manipulation, I'd use an object array (a la pandas), then convert to the uniform-width string dtype when I needed to push this out to a C API, HDF5 file, or whatever actually requires a string-dtype array. The required width gets computed from the data after all of the manipulations are done. Doing in-memory assignments to a fixed-encoding, fixed-width string dtype will always have this kind of problem. You should only put up with it if you have a requirement to write to a format that specifies the width and the encoding. That specified encoding is frequently not latin-1!

>> I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.

But what if the format I'm working with specifies another encoding? Am I supposed to encode all of my Unicode strings in the specified encoding, then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a really important use case for me.

--
Robert Kern