2017-04-27 13:27 GMT+02:00 Neal Becker <ndbecker2@gmail.com>:
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it?
Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-memory, there are several compressed data containers, like: https://github.com/alimanfoo/zarr (meant mainly for multidimensional data containers) https://github.com/Blosc/bcolz (meant mainly for tabular data containers) (there might be others).
On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <faltet@gmail.com> wrote:
2017-04-27 3:34 GMT+02:00 Stephan Hoyer <shoyer@gmail.com>:
On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <njs@pobox.com> wrote:
It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.
I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.
I actually agree with you. I think it's mostly a matter of convenience that h5py matched up HDF5 dtypes with numpy dtypes: fixed width ASCII -> np.string_/bytes variable length ASCII -> object arrays of np.string_/bytes variable length UTF-8 -> object arrays of unicode
This was tenable in a Python 2 world, but on Python 3 it's broken and there's not an easy fix.
We absolutely could fix h5py by mapping everything to object arrays of Python unicode strings, as has been discussed ( https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would be a fine but non-ideal solution, since there is currently no fixed width UTF-8 support.
For fixed width ASCII arrays, this would mean increased convenience for Python 3 users, at the price of decreased convenience for Python 2 users (arrays now contain boxed Python objects), unless we made the h5py behavior dependent on the version of Python. Hence, we're back here, waiting for better dtypes for encoded strings.
So for HDF5, I see good use cases for ASCII-with-surrogateescape (for handling ASCII arrays as strings) and UTF-8 with length equal to the number of bytes.
Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :)
As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter).
I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex).
Francesc
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Francesc Alted