truncating null bytes in 'S' breaks decoding that needs them
a = np.array([si.encode('utf-16LE') for si in ['Õsc', 'zxc']], dtype='S') a array([b'\xd5\x00s\x00c', b'z\x00x\x00c'], dtype='|S6')
[ai.decode('utf-16LE') for ai in a] Traceback (most recent call last): File "<pyshell#118>", line 1, in <module> [ai.decode('utf-16LE') for ai in a] File "<pyshell#118>", line 1, in <listcomp> [ai.decode('utf-16LE') for ai in a] File "C:\Programs\Python33\lib\encodings\utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data
messy workaround (arrays in contrast to scalars are not truncated in `tostring`)
[a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] ['Õsc', 'zxc']
Found while playing with examples in the other thread. Josef
Josef, Nice find -- another reason why 'S' can NOT be used a-is for arbitrary bytes. See the other thread for my proposals about that.
messy workaround (arrays in contrast to scalars are not truncated in `tostring`)
[a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] ['Õsc', 'zxc']
I think the real "work around" is to not try to store arbitrary bytes -- i.e. encoded text, in the 'S' dtype. But is there a convenient way to do it with other existing numpy types? I tried to do it with uint8, and it's really awkward.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (2)
-
Chris Barker
-
josef.pktd@gmail.com