[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 23:07:33 EDT 2017

On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern at gmail.com>
wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is
support
> >> for arrays that can hold variable-length strings, and the only place
where
> >> the current approach is *optimal* is when we need mmap compatibility
with
> >> legacy formats that use fixed-width-nul-padded fields (at which point
it's
> >> super convenient). It's not even possible to *represent* all Python
strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this
use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.

It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor
PyTables supports that kind of HDF5 Dataset. The variable-length Datasets
can take up a lot of disk-space because they can't be compressed (even
accounting for the wasted padding space). I mean, they probably could have
implemented it with objects arrays like h5py does with the variable-length
string Datasets, but they didn't.

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/624

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/145444e6/attachment-0001.html>