On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern@gmail.com> wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is support
> >> for arrays that can hold variable-length strings, and the only place where
> >> the current approach is *optimal* is when we need mmap compatibility with
> >> legacy formats that use fixed-width-nul-padded fields (at which point it's
> >> super convenient). It's not even possible to *represent* all Python strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.

It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't.


Robert Kern