[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 19:19:16 EDT 2017

On Mon, Apr 24, 2017 at 4:08 PM, Robert Kern <robert.kern at gmail.com> wrote:

> Let me make a counter-proposal for your latin-1 dtype (your #2) that might
> address your, Thomas's, and Julian's use cases:
>
> 2) We want a single-byte-per-character, NULL-terminated string dtype that
> can be used to represent mostly-ASCII textish data that may have some
> high-bit characters from some 8-bit encoding. It should be able to read
> arbitrary bytes (that is, up to the NULL-termination) and write them back
> out as the same bytes if unmodified. This lets us read this text from files
> where the encoding is unspecified (or is lying about the encoding) into
> `unicode/str` objects. The encoding is specified as `ascii` but the
> decoding/encoding is done with the `surrogateescape` option so that
> high-bit characters are faithfully represented in the `unicode/str` string
> but are not erroneously reinterpreted as other characters from an arbitrary
> encoding.
>
> I'd even be happy if Julian or someone wants to go ahead and implement
> this right now and leave the UTF-8 dtype for a later time.
>
> As long as this ASCII-surrogateescape dtype is not called np.realstring
> (it's *really* important to me that the bikeshed not be this color). ;-)
>

This sounds quite similar to my text[unknown] proposal, with the advantage
that the concept of "surrogateescape" that already exists. Surrogate-escape
characters compare equal to themselves, which is maybe less than ideal, but
it looks like you can put them in real unicode strings, which is nice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/a477be1a/attachment.html>