[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Wed Apr 26 15:17:15 EDT 2017


On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian at sipsolutions.net>
wrote:

> I remember talking with a colleague about something like that. And
> basically an annoying thing there was that if you strip the zero bytes
> in a zero padded string, some encodings (UTF16) may need one of the
> zero bytes to work right. (I think she got around it, by weird
> trickery, inverting the endianess or so and thus putting the zero bytes
> first).
> Maybe will ask her if this discussion is interesting to her. Though I
> think it might have been something like "make everything in
> hdf5/something similar work" without any actual use case, I don't know.

I don't think that will be an issue for an encoding-parameterized dtype.
The decoding machinery of that would have access to the full-width buffer
for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes
for UTF-16). It's only if you have to hack around at a higher level with
numpy's S arrays, which return Python byte strings that strip off the
trailing NULL bytes, that you have to worry about such things. Getting a
Python scalar from the numpy S array loses information in such cases.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170426/fe39bdb3/attachment-0001.html>


More information about the NumPy-Discussion mailing list