On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern@gmail.com>
wrote:
On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs@pobox.com> wrote:
That said, AFAICT what people actually want in most use cases is support for arrays that can hold variable-length strings, and the only place where the current approach is *optimal* is when we need mmap compatibility with legacy formats that use fixed-width-nul-padded fields (at which point it's super convenient). It's not even possible to *represent* all Python strings or bytestrings in current numpy unicode or string arrays (Python strings/bytestrings can have trailing nuls). So if we're talking about tweaks to the current system it probably makes sense to focus on this use case specifically.
From context I'm assuming FITS files use fixed-width-nul-padding for strings? Is that right? I know HDF5 doesn't.
Yes, HDF5 does. Or at least, it is supported in addition to the variable-length ones.
https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
Doh, I found that page but it was (and is) meaningless to me, so I went by http://docs.h5py.org/en/latest/strings.html, which says the options are fixed-width ascii, variable-length ascii, or variable-length utf-8 ... I guess it's just talking about what h5py currently supports.
It's okay, I made exactly the same mistake earlier in the thread. :-)
But also, is it important whether strings we're loading/saving to an HDF5 file have the same in-memory representation in numpy as they would in the file? I *know* [1] no-one is reading HDF5 files using np.memmap :-). Is it important for some other reason?
The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't. https://github.com/PyTables/PyTables/issues/499 https://github.com/h5py/h5py/issues/624 -- Robert Kern