[Numpy-discussion] proposal: smaller representation of string arrays

Nathaniel Smith njs at pobox.com
Wed Apr 26 19:49:29 EDT 2017


On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern at gmail.com> wrote:

On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
[...]
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being
read via memory mapping.

  http://docs.astropy.org/en/stable/io/fits/index.html

You were even pointed to a minor HDF5 implementation that memory maps:

  https://github.com/jjhelmus/pyfive/blob/master/pyfive/low_
level.py#L682-L683

I'm afraid that I can't share the actual code of the full variety of
proprietary file formats that I've written code for, I can assure you that
I have memory mapped many string arrays in my time, usually embedded as
columns in structured arrays. It is not "nice to have"; it is "have done
many times and needs better support".


Since concrete examples are often helpful in focusing discussions, here's
some code for reading a lab-internal EEG file format:

https://github.com/rerpy/rerpy/blob/master/rerpy/io/erpss.py

See in particular _header_dtype with its embedded string fields, and the
code in _channel_names_from_header -- both of these really benefit from
having a quick and easy way to talk about fixed width strings of single
byte characters. (The history here of course is that the original tools for
reading/writing this format are written in C, and they just read in
sizeof(struct header) and cast to the header.)

_get_full_string in that file is also interesting: it's a nasty hack I
implemented because in some cases I actually needed *fixed width* strings,
not NUL padded ones, and didn't know a better way to do it. (Yes, there's
void, but I have no idea how those work. They're somehow related to buffer
objects, whatever those are?) In other cases though that file really does
want NUL padding.

Of course that file is python 2 and blissfully ignorant of unicode.
Thinking about what we'd want if porting to py3:

For the "pull out this fixed width chunk of the file" problem (what
_get_full_string does) then I definitely don't care about unicode; this
isn't text. np.void or an array of np.uint8 aren't actually too terrible I
suspect, but it'd be nice if there were a fixed-width dtype where indexing
gave back a native bytes or bytearray object, or something similar like
np.bytes_.

For the arrays of single-byte-encoded-NUL-padded text, then the fundamental
problem is just to convert between a chunk of bytes in that format and
something that numpy can handle. One way to do that would be with an dtype
that represented ascii-encoded-fixed-width-NUL-padded text, or any
ascii-compatible encoding. But honestly I'd be just as happy with
np.encode/np.decode ufuncs that converted between the existing S dtype and
any kind of text array; the existing U dtype would be fine given that.

The other thing that might be annoying in practice is that when writing
py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on
py3", but there's no dtype with similar behavior. Maybe there's no good
solution and this just needs a few version-dependent convenience functions
stuck in a private utility library, dunno.


> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode
arrays, despite user requests for years. The sticking point seems to be the
difference between HDF5's view of a Unicode string array (defined in size
by the bytes of UTF-8 data) and numpy's current view of a Unicode string
array (because of UCS-4, defined by the number of
characters/codepoints/whatever). So there are HDF5 files out there that
none of our HDF5 bindings can read, and it is impossible to write certain
data efficiently.


I would really like to hear more from the authors of these libraries about
what exactly it is they feel they're missing. Is it that they want numpy to
enforce the length limit early, to catch errors when the array is modified
instead of when they go to write it to the file? Is it that they really
want an O(1) way to look at a array and know the maximum number of bytes
needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is
really annoying and files that need it are rare so they haven't had the
motivation to implement it? My impression is similar to Julian's: you
*could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few
dozen lines of code, which is nothing compared to all the other hoops these
libraries are already jumping through, so if this is really the roadblock
then I must be missing something.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/11f5d062/attachment.html>


More information about the NumPy-Discussion mailing list