On Apr 26, 2017 12:09 PM, "Robert Kern" <robert.kern@gmail.com> wrote:
On Wed, Apr 26, 2017 at 10:43 AM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
> I have read every mail and it has been a large waste of time, Everything
> has been said already many times in the last few years.
> Even if you memory map string arrays, of which I have not seen a
> concrete use case in the mails beyond "would be nice to have" without
> any backing in actual code, but I may have missed it.

Yes, we have stated that FITS files with string arrays are currently being read via memory mapping.

You were even pointed to a minor HDF5 implementation that memory maps:

I'm afraid that I can't share the actual code of the full variety of proprietary file formats that I've written code for, I can assure you that I have memory mapped many string arrays in my time, usually embedded as columns in structured arrays. It is not "nice to have"; it is "have done many times and needs better support".

Since concrete examples are often helpful in focusing discussions, here's some code for reading a lab-internal EEG file format:


See in particular _header_dtype with its embedded string fields, and the code in _channel_names_from_header -- both of these really benefit from having a quick and easy way to talk about fixed width strings of single byte characters. (The history here of course is that the original tools for reading/writing this format are written in C, and they just read in sizeof(struct header) and cast to the header.)

_get_full_string in that file is also interesting: it's a nasty hack I implemented because in some cases I actually needed *fixed width* strings, not NUL padded ones, and didn't know a better way to do it. (Yes, there's void, but I have no idea how those work. They're somehow related to buffer objects, whatever those are?) In other cases though that file really does want NUL padding.

Of course that file is python 2 and blissfully ignorant of unicode. Thinking about what we'd want if porting to py3:

For the "pull out this fixed width chunk of the file" problem (what _get_full_string does) then I definitely don't care about unicode; this isn't text. np.void or an array of np.uint8 aren't actually too terrible I suspect, but it'd be nice if there were a fixed-width dtype where indexing gave back a native bytes or bytearray object, or something similar like np.bytes_.

For the arrays of single-byte-encoded-NUL-padded text, then the fundamental problem is just to convert between a chunk of bytes in that format and something that numpy can handle. One way to do that would be with an dtype that represented ascii-encoded-fixed-width-NUL-padded text, or any ascii-compatible encoding. But honestly I'd be just as happy with np.encode/np.decode ufuncs that converted between the existing S dtype and any kind of text array; the existing U dtype would be fine given that.

The other thing that might be annoying in practice is that when writing py2/py3 polyglot code, I can say "str" to mean "bytes on py2 and unicode on py3", but there's no dtype with similar behavior. Maybe there's no good solution and this just needs a few version-dependent convenience functions stuck in a private utility library, dunno.

> What you save by having utf8 in the numpy array is replacing a decoding
> ane encoding step with a stripping null padding step.
> That doesn't seem very worthwhile compared to all their other overheads
> involved.

It's worthwhile enough that both major HDF5 bindings don't support Unicode arrays, despite user requests for years. The sticking point seems to be the difference between HDF5's view of a Unicode string array (defined in size by the bytes of UTF-8 data) and numpy's current view of a Unicode string array (because of UCS-4, defined by the number of characters/codepoints/whatever). So there are HDF5 files out there that none of our HDF5 bindings can read, and it is impossible to write certain data efficiently.

I would really like to hear more from the authors of these libraries about what exactly it is they feel they're missing. Is it that they want numpy to enforce the length limit early, to catch errors when the array is modified instead of when they go to write it to the file? Is it that they really want an O(1) way to look at a array and know the maximum number of bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion is really annoying and files that need it are rare so they haven't had the motivation to implement it? My impression is similar to Julian's: you *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few dozen lines of code, which is nothing compared to all the other hoops these libraries are already jumping through, so if this is really the roadblock then I must be missing something.