
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet@gmail.com> wrote:
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back
This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently. I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format. So I think numpy needs to support at least: utf-8 latin-1 UCS-4 And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and .... So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not? (I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code) If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases. As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type. The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects?? @Francesc: -- one more question for you: How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov