On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern@gmail.com> wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
>
> I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it?

https://github.com/PyTables/PyTables/issues/499

--
Robert Kern