
On Wed, Apr 26, 2017 at 11:38 AM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
I remember talking with a colleague about something like that. And basically an annoying thing there was that if you strip the zero bytes in a zero padded string, some encodings (UTF16) may need one of the zero bytes to work right. (I think she got around it, by weird trickery, inverting the endianess or so and thus putting the zero bytes first). Maybe will ask her if this discussion is interesting to her. Though I think it might have been something like "make everything in hdf5/something similar work" without any actual use case, I don't know.
I don't think that will be an issue for an encoding-parameterized dtype. The decoding machinery of that would have access to the full-width buffer for the item, and the encoding knows what it's atomic unit is (e.g. 2 bytes for UTF-16). It's only if you have to hack around at a higher level with numpy's S arrays, which return Python byte strings that strip off the trailing NULL bytes, that you have to worry about such things. Getting a Python scalar from the numpy S array loses information in such cases. -- Robert Kern