On Thu, Apr 20, 2017 at 3:17 PM Julian Taylor <jtaylor.debian@googlemail.com> wrote:

To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datatime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.

I should say first that I've never used even non-Unicode string arrays, but is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand.

Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.)

Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type. Instead, for encoded Unicode, the string could be truncated so that the encoding fits. Of course this is not completely trivial for variable-length encodings, but it should be doable, and it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit.

All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

Anne