Re: [Numpy-discussion] proposal: smaller representation of string arrays

24 Apr 2017

      On Mon, Apr 24, 2017 at 10:04 AM, Chris Barker 
wrote:
...
latin-1 or latin-9 buys you (over ASCII):
...
- round-tripping of binary data (at least with Python's encoding/decoding)
-- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
same bytes back. You may get garbage, but you won't get an EncodingError.
For a new application, it's a good thing if a text type breaks when you to
stuff arbitrary bytes in it (see Python 2 vs Python 3 strings).

Certainly, I would argue that nobody should write data in latin-1 unless
they're doing so for the sake of a legacy application.

I do understand the value in having some "string" data type that could be
used by default by loaders for legacy file formats/applications (i.e.,
netCDF3) that support unspecified "one byte strings." Then you're a few
short calls away from viewing (i.e., array.view('text[my_real_encoding]'),
if we support arbitrary encodings) or decoding (i.e.,
np.char.decode(array.view(bytes), 'my_real_encoding') ) the data in the
proper encoding. It's not realistic to expect users to know the true
encoding for strings from a file before they even look at the data.

On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
...
Then use a native flexible-encoding dtype for everything else.
...
...
No opposition here from me. Though again, I think utf-8 alone would also
be enough.
maybe so -- the major reason for supporting others is binary data exchange
with other libraries -- but maybe most of them have gone to utf-8 anyway.
Indeed, it would be helpful for this discussion to know what other
encodings are actually currently used by scientific applications.

So far, we have real use cases for at least UTF-8, UTF-32, ASCII and
"unknown".

The current 'S' dtype truncates silently already:
...
One advantage of a new (non-default) dtype is that we can change this
behavior.
...
Also -- if utf-8 is the default -- what do you get when you create an
array from a python string sequence? Currently with the 'S' and 'U' dtypes,
the dtype is set to the longest string passed in. Are we going to pad it a
bit? stick with the exact number of bytes?
It might be better to avoid this for now, and force users to be explicit
about encoding if they use the dtype for encoded text. We can keep
bytes/str mapped to the current choices.

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer