Re: [Numpy-discussion] proposal: smaller representation of string arrays

26 Apr 2017


      On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg 
wrote:
...
I remember talking with a colleague about something like that. And
basically an annoying thing there was that if you strip the zero bytes
in a zero padded string, some encodings (UTF16) may need one of the
zero bytes to work right. (I think she got around it, by weird
trickery, inverting the endianess or so and thus putting the zero bytes
first).
Maybe will ask her if this discussion is interesting to her. Though I
think it might have been something like "make everything in
hdf5/something similar work" without any actual use case, I don't know.
I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern