On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-).
Of course they do :) https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60...
Also, further searching suggests that HDF5 actually supports all of nul termination, nul padding, and space padding, and that nul termination is the default? How much does it help to have in-memory compatibility with just one of these options (and not even the default one)? Would we need to add the other options to be really useful for HDF5?
h5py actually ignores this option and only uses null termination. I have not heard any complaints about this (though I have heard complaints about the lack of fixed-length UTF-8). But more generally, you're right. h5py doesn't need a corresponding NumPy dtype for each HDF5 string dtype, though that would certainly be *convenient*. In fact, it already (ab)uses NumPy's dtype metadata with h5py.special_dtype to indicate a homogeneous string type for object arrays. I would guess h5py users have the same needs for efficient string representations (including surrogate-escape options) as other scientific users.