On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:

But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-).

Of course they do :)

https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682

Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5?

h5py actually ignores this option and only uses null termination. I have not heard any complaints about this (though I have heard complaints about the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy dtype for each HDF5 string dtype, though that would certainly be convenient. In fact, it already (ab)uses NumPy's dtype metadata with h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string representations (including surrogate-escape options) as other scientific users.