[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 20:08:30 EDT 2017

On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
> On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern at gmail.com> wrote:

>> It's worthwhile enough that both major HDF5 bindings don't support
Unicode arrays, despite user requests for years. The sticking point seems
to be the difference between HDF5's view of a Unicode string array (defined
in size by the bytes of UTF-8 data) and numpy's current view of a Unicode
string array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.
>
> I would really like to hear more from the authors of these libraries
about what exactly it is they feel they're missing. Is it that they want
numpy to enforce the length limit early, to catch errors when the array is
modified instead of when they go to write it to the file? Is it that they
really want an O(1) way to look at a array and know the maximum number of
bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion
is really annoying and files that need it are rare so they haven't had the
motivation to implement it?

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/379

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/704eb753/attachment.html>