[Numpy-discussion] proposal: smaller representation of string arrays
shoyer at gmail.com
Mon Apr 24 23:01:48 EDT 2017
On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs at pobox.com> wrote:
> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know*  no-one is reading HDF5 files using
> np.memmap :-).
Of course they do :)
> Also, further searching suggests that HDF5 actually supports all of
> nul termination, nul padding, and space padding, and that nul
> termination is the default? How much does it help to have in-memory
> compatibility with just one of these options (and not even the default
> one)? Would we need to add the other options to be really useful for
h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).
But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.
I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion