On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.

I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.

I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes:
fixed width ASCII -> np.string_/bytes
variable length ASCII -> object arrays of np.string_/bytes
variable length UTF-8 -> object arrays of unicode

This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix.

We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed (https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support.

For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings.

So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.