On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern@gmail.com> wrote:

Let me make a counter-proposal for your latin-1 dtype (your #2) that might address your, Thomas's, and Julian's use cases:

2) We want a single-byte-per-character, NULL-terminated string dtype that can be used to represent mostly-ASCII textish data that may have some high-bit characters from some 8-bit encoding. It should be able to read arbitrary bytes (that is, up to the NULL-termination) and write them back out as the same bytes if unmodified. This lets us read this text from files where the encoding is unspecified (or is lying about the encoding) into `unicode/str` objects. The encoding is specified as `ascii` but the decoding/encoding is done with the `surrogateescape` option so that high-bit characters are faithfully represented in the `unicode/str` string but are not erroneously reinterpreted as other characters from an arbitrary encoding.

I'd even be happy if Julian or someone wants to go ahead and implement this right now and leave the UTF-8 dtype for a later time.

As long as this ASCII-surrogateescape dtype is not called np.realstring (it's *really* important to me that the bikeshed not be this color). ;-)

This sounds quite similar to my text[unknown] proposal, with the advantage that the concept of "surrogateescape" that already exists. Surrogate-escape characters compare equal to themselves, which is maybe less than ideal, but it looks like you can put them in real unicode strings, which is nice.