[Numpy-discussion] proposal: smaller representation of string arrays

Thu Apr 27 07:27:31 EDT 2017

So while compression+ucs-4 might be OK for out-of-core representation, what
about in-core?  blosc+ucs-4?  I don't think that works for mmap, does it?

On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet at gmail.com> wrote:

> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer at gmail.com>:
>
>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>>> It's worthwhile enough that both major HDF5 bindings don't support
>>> Unicode arrays, despite user requests for years. The sticking point seems
>>> to be the difference between HDF5's view of a Unicode string array (defined
>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
>>> string array (because of UCS-4, defined by the number of
>>> characters/codepoints/whatever). So there are HDF5 files out there that
>>> none of our HDF5 bindings can read, and it is impossible to write certain
>>> data efficiently.
>>>
>>>
>>> I would really like to hear more from the authors of these libraries
>>> about what exactly it is they feel they're missing. Is it that they want
>>> numpy to enforce the length limit early, to catch errors when the array is
>>> modified instead of when they go to write it to the file? Is it that they
>>> really want an O(1) way to look at a array and know the maximum number of
>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
>>> is really annoying and files that need it are rare so they haven't had the
>>> motivation to implement it? My impression is similar to Julian's: you
>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
>>> dozen lines of code, which is nothing compared to all the other hoops these
>>> libraries are already jumping through, so if this is really the roadblock
>>> then I must be missing something.
>>>
>>
>> I actually agree with you. I think it's mostly a matter of convenience
>> that h5py matched up HDF5 dtypes with numpy dtypes:
>> fixed width ASCII -> np.string_/bytes
>> variable length ASCII -> object arrays of np.string_/bytes
>> variable length UTF-8 -> object arrays of unicode
>>
>> This was tenable in a Python 2 world, but on Python 3 it's broken and
>> there's not an easy fix.
>>
>> We absolutely could fix h5py by mapping everything to object arrays of
>> Python unicode strings, as has been discussed (
>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this
>> would be a fine but non-ideal solution, since there is currently no fixed
>> width UTF-8 support.
>>
>> For fixed width ASCII arrays, this would mean increased convenience for
>> Python 3 users, at the price of decreased convenience for Python 2 users
>> (arrays now contain boxed Python objects), unless we made the h5py behavior
>> dependent on the version of Python. Hence, we're back here, waiting for
>> better dtypes for encoded strings.
>>
>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for
>> handling ASCII arrays as strings) and UTF-8 with length equal to the number
>> of bytes.
>>
>
> Well, I'll say upfront that I have not read this discussion in the fully,
> but apparently some opinions from developers of HDF5 Python packages would
> be welcome here, so here I go :) 
>
> As a long-time developer of one of the Python HDF5 packages (PyTables), I
> have always been of the opinion that plain ASCII (for byte strings) and
> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing
> large amounts of data, most specially for disk storage (but also using
> compressed in-memory containers).  My rational is that, although UCS-4 may
> require way too much space, compression would reduce that to basically the
> space that is required by compressed UTF-8 (I won't go into detail, but
> basically this is possible by using the shuffle filter).
>
> I remember advocating for UCS-4 adoption in the HDF5 library many years
> ago (2007?), but I had no success and UTF-8 was decided to be the best
> candidate.  So, the boat with HDF5 using UTF-8 sailed many years ago, and I
> don't think there is a go back (not even adding UCS-4 support on it,
> although I continue to think it would be a good idea).  So, I suppose that
> if HDF5 is found to be an important format for NumPy users (and I think
> this is the case), a solution for representing Unicode characters by using
> UTF-8 in NumPy would be desirable (at the risk of making the implementation
> more complex).
>
> Francesc
> 
>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
> Francesc Alted
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170427/95ea4dfe/attachment-0001.html>