2017-04-27 18:18 GMT+02:00 Chris Barker <chris.barker@noaa.gov>:
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted <faltet@gmail.com> wrote:
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back
This is the key point -- we can argue all we want about the best encoding for fixed-length unicode-supporting strings (I think numpy and HDF have very similar requirements), but that is not our decision to make -- many other systems have chosen utf-8, so it's a really good idea for numpy to be able to deal with that cleanly and easily and consistently.
Agreed. But it would also be a good idea to spread the word that simple UCS4 encoding in combination with compression can be a perfectly good system for storing large amounts of unicode data too.
I have made many anti utf-8 points in this thread because while we need to deal with utf-8 for interplay with other systems, I am very sure that it is not the best format for a default, naive-user-of-numpy unicode-supporting dtype. Nor is it the best encoding for a mostly-ascii compact in memory format.
I resonate a lot with this feeling too :)
So I think numpy needs to support at least:
utf-8 latin-1 UCS-4
And it maybe should support one-byte encoding suitable for non-european languages, and maybe utf-16 for Java and Windows compatibility, and ....
So that seems to point to "support as many encodings as possible" And python has the machinery to do so -- so why not?
(I'm taking Julian's word for it that having a parameterized dtype would not have a major impact on current code)
If we go with a parameterized by encoding string dtype, then we can pick sensible defaults, and let users use what they know best fits their use-cases.
As for python2 -- it is on the way out, I think we should keep the 'U' and 'S' dtypes as they are for backward compatibility and move forward with the new one(s) in a way that is optimized for py3. And it would map to a py2 Unicode type.
The only catch I see in that is what to do with bytes -- we should have a numpy dtype that matches the bytes model -- fixed length bytes that map to python bytes objects. (this is almost what teh void type is yes?) but then under py2, would a bytes object (py2 string) map to numpy 'S' or numpy bytes objects??
@Francesc: -- one more question for you:
How important is it for pytables to match the numpy storage to the hdf storage byte for byte? i.e. would it be a killer if encoding / decoding happened every time at the boundary? I'm guessing yes, as this would have been solved long ago if not.
The PyTables team decided some time ago that it was a waste of time and resources to maintain the internal HDF5 interface, and that it would be better to switch to h5py for the low I/O communication with HDF5 (btw, we just received a small NumFOCUS grant for continue the ongoing work on this; thanks guys!). This means that PyTables will be basically agnostic about this sort of encoding issues, and that the important package to have in account for interfacing NumPy and HDF5 is just h5py. -- Francesc Alted