[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 27 07:10:42 EDT 2017

2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:

> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> It's worthwhile enough that both major HDF5 bindings don't support
>> Unicode arrays, despite user requests for years. The sticking point seems
>> to be the difference between HDF5's view of a Unicode string array (defined
>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>> string array (because of UCS-4, defined by the number of
>> characters/codepoints/whatever). So there are HDF5 files out there that
>> none of our HDF5 bindings can read, and it is impossible to write certain
>> data efficiently.
>>
>>
>> I would really like to hear more from the authors of these libraries
>> about what exactly it is they feel they're missing. Is it that they want
>> numpy to enforce the length limit early, to catch errors when the array is
>> modified instead of when they go to write it to the file? Is it that they
>> really want an O(1) way to look at a array and know the maximum number of
>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>> is really annoying and files that need it are rare so they haven't had the
>> motivation to implement it? My impression is similar to Julian's: you
>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>> dozen lines of code, which is nothing compared to all the other hoops these
>> libraries are already jumping through, so if this is really the roadblock
>> then I must be missing something.
>>
>
> I actually agree with you. I think it's mostly a matter of convenience
> that h5py matched up HDF5 dtypes with numpy dtypes:
> fixed width ASCII -> np.string_/bytes
> variable length ASCII -> object arrays of np.string_/bytes
> variable length UTF-8 -> object arrays of unicode
>
> This was tenable in a Python 2 world, but on Python 3 it's broken and
> there's not an easy fix.
>
> We absolutely could fix h5py by mapping everything to object arrays of
> Python unicode strings, as has been discussed (
> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would
> be a fine but non-ideal solution, since there is currently no fixed width
> UTF-8 support.
>
> For fixed width ASCII arrays, this would mean increased convenience for
> Python 3 users, at the price of decreased convenience for Python 2 users
> (arrays now contain boxed Python objects), unless we made the h5py behavior
> dependent on the version of Python. Hence, we're back here, waiting for
> better dtypes for encoded strings.
>
> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
> handling ASCII arrays as strings) and UTF-8 with length equal to the number
> of bytes.
>

Well, I'll say upfront that I have not read this discussion in the fully,
but apparently some opinions from developers of HDF5 Python packages would
be welcome here, so here I go :) 

As a long-time developer of one of the Python HDF5 packages (PyTables), I
have always been of the opinion that plain ASCII (for byte strings) and
UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
large amounts of data, most specially for disk storage (but also using
compressed in-memory containers).  My rational is that, although UCS-4 may
require way too much space, compression would reduce that to basically the
space that is required by compressed UTF-8 (I won't go into detail, but
basically this is possible by using the shuffle filter).

I remember advocating for UCS-4 adoption in the HDF5 library many years ago
(2007?), but I had no success and UTF-8 was decided to be the best
candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
don't think there is a go back (not even adding UCS-4 support on it,
although I continue to think it would be a good idea).  So, I suppose that
if HDF5 is found to be an important format for NumPy users (and I think
this is the case), a solution for representing Unicode characters by using
UTF-8 in NumPy would be desirable (at the risk of making the implementation
more complex).

Francesc

>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>

-- 
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/cc8aa05a/attachment.html>