Re: [Numpy-discussion] proposal: smaller representation of string arrays

25 Apr 2017

      On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
...
But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-).
Of course they do :)
https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60...
...
Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5?
h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
users.

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer