
truncating null bytes in 'S' breaks decoding that needs them
a = np.array([si.encode('utf-16LE') for si in ['Õsc', 'zxc']], dtype='S') a
array([b'\xd5\x00s\x00c', b'z\x00x\x00c'], dtype='|S6')
[ai.decode('utf-16LE') for ai in a]
Traceback (most recent call last): File "<pyshell#118>", line 1, in <module> [ai.decode('utf-16LE') for ai in a] File "<pyshell#118>", line 1, in <listcomp> [ai.decode('utf-16LE') for ai in a] File "C:\Programs\Python33\lib\encodings\utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data
messy workaround (arrays in contrast to scalars are not truncated in `tostring`)
[a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))]
['Õsc', 'zxc']
Found while playing with examples in the other thread.
Josef

Josef,
Nice find -- another reason why 'S' can NOT be used a-is for arbitrary bytes.
See the other thread for my proposals about that.
messy workaround (arrays in contrast to scalars are not truncated in `tostring`)
[a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))]
['Õsc', 'zxc']
I think the real "work around" is to not try to store arbitrary bytes -- i.e. encoded text, in the 'S' dtype.
But is there a convenient way to do it with other existing numpy types?
I tried to do it with uint8, and it's really awkward....
-CHB
participants (2)
-
Chris Barker
-
josef.pktd@gmail.com