<div dir="ltr">On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>> wrote:<br>><br>> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <<a href="mailto:robert.kern@gmail.com">robert.kern@gmail.com</a>> wrote:<br>> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <<a href="mailto:njs@pobox.com">njs@pobox.com</a>> wrote:<br>> ><br>> >> That said, AFAICT what people actually want in most use cases is support<br>> >> for arrays that can hold variable-length strings, and the only place where<br>> >> the current approach is *optimal* is when we need mmap compatibility with<br>> >> legacy formats that use fixed-width-nul-padded fields (at which point it's<br>> >> super convenient). It's not even possible to *represent* all Python strings<br>> >> or bytestrings in current numpy unicode or string arrays (Python<br>> >> strings/bytestrings can have trailing nuls). So if we're talking about<br>> >> tweaks to the current system it probably makes sense to focus on this use<br>> >> case specifically.<br>> >><br>> >> From context I'm assuming FITS files use fixed-width-nul-padding for<br>> >> strings? Is that right? I know HDF5 doesn't.<br>> ><br>> > Yes, HDF5 does. Or at least, it is supported in addition to the<br>> > variable-length ones.<br>> ><br>> > <a href="https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html">https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html</a><br>><br>> Doh, I found that page but it was (and is) meaningless to me, so I<br>> went by <a href="http://docs.h5py.org/en/latest/strings.html">http://docs.h5py.org/en/latest/strings.html</a>, which says the<br>> options are fixed-width ascii, variable-length ascii, or<br>> variable-length utf-8 ... I guess it's just talking about what h5py<br>> currently supports.<br><br>It's okay, I made exactly the same mistake earlier in the thread. :-)<br><br>> But also, is it important whether strings we're loading/saving to an<br>> HDF5 file have the same in-memory representation in numpy as they<br>> would in the file? I *know* [1] no-one is reading HDF5 files using<br>> np.memmap :-). Is it important for some other reason?<br><br>The lack of such a dtype seems to be the reason why neither h5py nor PyTables supports that kind of HDF5 Dataset. The variable-length Datasets can take up a lot of disk-space because they can't be compressed (even accounting for the wasted padding space). I mean, they probably could have implemented it with objects arrays like h5py does with the variable-length string Datasets, but they didn't.<br><br><a href="https://github.com/PyTables/PyTables/issues/499">https://github.com/PyTables/PyTables/issues/499</a><div><a href="https://github.com/h5py/h5py/issues/624">https://github.com/h5py/h5py/issues/624</a><div><br>--<br>Robert Kern</div></div></div>