On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet@gmail.com> wrote:
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back

This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently.

I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format.

So I think numpy needs to support at least:

utf-8
latin-1
UCS-4

And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and ....

So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not? 

(I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code)

If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases.

As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type.

The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects?? 

@Francesc: -- one more question for you:

How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not.

-CHB

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov