[Numpy-discussion] proposal: smaller representation of string arrays

Nathaniel Smith njs at pobox.com
Mon Apr 24 22:41:31 EDT 2017


On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>
>> That said, AFAICT what people actually want in most use cases is support
>> for arrays that can hold variable-length strings, and the only place where
>> the current approach is *optimal* is when we need mmap compatibility with
>> legacy formats that use fixed-width-nul-padded fields (at which point it's
>> super convenient). It's not even possible to *represent* all Python strings
>> or bytestrings in current numpy unicode or string arrays (Python
>> strings/bytestrings can have trailing nuls). So if we're talking about
>> tweaks to the current system it probably makes sense to focus on this use
>> case specifically.
>>
>> From context I'm assuming FITS files use fixed-width-nul-padding for
>> strings? Is that right? I know HDF5 doesn't.
>
> Yes, HDF5 does. Or at least, it is supported in addition to the
> variable-length ones.
>
> https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html

Doh, I found that page but it was (and is) meaningless to me, so I
went by http://docs.h5py.org/en/latest/strings.html, which says the
options are fixed-width ascii, variable-length ascii, or
variable-length utf-8 ... I guess it's just talking about what h5py
currently supports.

But also, is it important whether strings we're loading/saving to an
HDF5 file have the same in-memory representation in numpy as they
would in the file? I *know* [1] no-one is reading HDF5 files using
np.memmap :-). Is it important for some other reason?

Also, further searching suggests that HDF5 actually supports all of
nul termination, nul padding, and space padding, and that nul
termination is the default? How much does it help to have in-memory
compatibility with just one of these options (and not even the default
one)? Would we need to add the other options to be really useful for
HDF5? (Unlikely to happen within numpy itself, but potentially
something that could be done inside h5py or whatever if numpy's
user-defined dtype system were a little more useful.)

-n

[1] hope

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the NumPy-Discussion mailing list