Thanks so much for reviving this conversation -- we really do need to address this.

My thoughts:

What people apparently want is a string type for Python3 which uses less
memory for the common science use case which rarely needs more than
latin1 encoding.

Yes -- I think there is a real demand for that.




https://en.wikipedia.org/wiki/ISO/IEC_8859-15

To please everyone I think we need to go with a dtype that supports
multiple encodings via metadata, similar to how datetime supports
multiple units.
E.g.: 'U10[latin1]' are 10 characters in latin1 encoding

I wonder if we really need that -- as you say, there is real demand for compact string type, but for many use cases, 1 byte per character is enough. So to keep things really simple, I think a single 1-byte per char encoding would meet most people's needs.

What should that encoding be?

latin-1 is obvious (and has the very nice property of being able to round-trip arbitrary bytes -- at least with Python's implementation) and scientific data sets tend to use the latin alphabet (with its ascii roots and all).

But there is now latin-9:

https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Maybe a better option?

Encodings we should support are:
- latin1 (1 bytes):
it is compatible with ascii and adds extra characters used in the
western world.
- utf-32 (4 bytes):
can represent every character, equivalent with np.unicode

IIUC, datetime64 is, well, always 64 bits. So it may be better to have a given dtype always be the same bitwidth.

So the utf-32 dtype would be a different dtype. which also keeps it really simple, we have a latin-* dtype and a full-on unicode dtype -- that's it.

Encodings we should maybe support:
- utf-16 with explicitly disallowing surrogate pairs (2 bytes):
this covers a very large range of possible characters in a reasonably
compact representation

I think UTF-16 is very simply, the worst of both worlds. If we want a two-byte character set, then it should be UCS-2 -- i.e. explicitly rejecting any code point that takes more than two bytes to represent. (or maybe that's what you mean by explicitly disallowing surrogate pairs). in any case, it should certainly give you an encoding error if you try to pass in a unicode character than can not fit into two bytes.

So: is there actually a demand for this? If so, then I think it should be a separate 2-byte string type, with the encoding always the same.
 
- utf-8 (4 bytes):
variable length encoding with minimum size of 1 bytes, but we would need
to assume the worst case of 4 bytes so it would not save anything
compared to utf-32 but may allow third parties replace an encoding step
with trailing null trimming on serialization.

yeach -- utf-8 is great for interchange and streaming data, but not for internal storage, particular with the numpy every item has the same number of bytes requirement. So if someone wants to work with ut-8 they can store it in a byte array, and encode and decode as they pass it to/from python. That's going to have to happen anyway, even if under the hood. And it's risky business -- if you truncate a utf-8 bytestring, you may get invalid data --  it  really does not belong in numpy.
 
- Add a new dtype, e.g. npy.realstring

I think that's the way to go. backwards compatibility is really key. Though could we make the existing string dtype a latin-1 always type without breaking too much? Or maybe depricate and get there in the future?

It has the cosmetic disadvantage that it makes the np.unicode dtype
obsolete and is more busywork to implement.

I think the np.unicode type should remain as the 4-bytes per char encoding. But that only makes sense if you follow my idea that we don't have a variable number of bytes per char dtype.

So my proposal is:

 - Create a new one-byte-per-char dtype that is always latin-9 encoded.
    - in python3 it would map to a string (i.e. unicode)
 - Keep the 4-byte per char unicode string type

Optionally (if there is really demand)
 - Create a new two-byte per char dtype that is always UCS-2 encoded.


Is there any way to leverage Python3's nifty string type? I'm thinking not. At least not for numpy arrays that can play well with C code, etc.

All that being said, a encoding-specified string dtype would be nice too -- I just think it's more complex that it needs to be. Numpy is not the tool for text processing...

-CHB



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov