On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker <chris.barker@noaa.gov> wrote:
latin-1 or latin-9 buys you (over ASCII):
...
- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to stuff arbitrary bytes in it (see Python 2 vs Python 3 strings). Certainly, I would argue that nobody should write data in latin-1 unless they're doing so for the sake of a legacy application. I do understand the value in having some "string" data type that could be used by default by loaders for legacy file formats/applications (i.e., netCDF3) that support unspecified "one byte strings." Then you're a few short calls away from viewing (i.e., array.view('text[my_real_encoding]'), if we support arbitrary encodings) or decoding (i.e., np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the proper encoding. It's not realistic to expect users to know the true encoding for strings from a file before they even look at the data. On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.
Then use a native flexible-encoding dtype for everything else.
No opposition here from me. Though again, I think utf-8 alone would also be enough.
maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway.
Indeed, it would be helpful for this discussion to know what other encodings are actually currently used by scientific applications. So far, we have real use cases for at least UTF-8, UTF-32, ASCII and "unknown". The current 'S' dtype truncates silently already:
One advantage of a new (non-default) dtype is that we can change this behavior.
Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit about encoding if they use the dtype for encoded text. We can keep bytes/str mapped to the current choices.