[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 21:34:41 EDT 2017

On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:

> It's worthwhile enough that both major HDF5 bindings don't support Unicode
> arrays, despite user requests for years. The sticking point seems to be the
> difference between HDF5's view of a Unicode string array (defined in size
> by the bytes of UTF-8 data) and numpy's current view of a Unicode string
> array (because of UCS-4, defined by the number of
> characters/codepoints/whatever). So there are HDF5 files out there that
> none of our HDF5 bindings can read, and it is impossible to write certain
> data efficiently.
>
>
> I would really like to hear more from the authors of these libraries about
> what exactly it is they feel they're missing. Is it that they want numpy to
> enforce the length limit early, to catch errors when the array is modified
> instead of when they go to write it to the file? Is it that they really
> want an O(1) way to look at a array and know the maximum number of bytes
> needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
> really annoying and files that need it are rare so they haven't had the
> motivation to implement it? My impression is similar to Julian's: you
> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
> dozen lines of code, which is nothing compared to all the other hoops these
> libraries are already jumping through, so if this is really the roadblock
> then I must be missing something.
>

I actually agree with you. I think it's mostly a matter of convenience that
h5py matched up HDF5 dtypes with numpy dtypes:
fixed width ASCII -> np.string_/bytes
variable length ASCII -> object arrays of np.string_/bytes
variable length UTF-8 -> object arrays of unicode

This was tenable in a Python 2 world, but on Python 3 it's broken and
there's not an easy fix.

We absolutely could fix h5py by mapping everything to object arrays of
Python unicode strings, as has been discussed (
https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
be a fine but non-ideal solution, since there is currently no fixed width
UTF-8 support.

For fixed width ASCII arrays, this would mean increased convenience for
Python 3 users, at the price of decreased convenience for Python 2 users
(arrays now contain boxed Python objects), unless we made the h5py behavior
dependent on the version of Python. Hence, we're back here, waiting for
better dtypes for encoded strings.

So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
handling ASCII arrays as strings) and UTF-8 with length equal to the number
of bytes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/11e10ec9/attachment-0001.html>