[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 23 15:18:18 EST 2014


On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On 23 January 2014 17:42,  <josef.pktd at gmail.com> wrote:
>> On Thu, Jan 23, 2014 at 12:13 PM,  <josef.pktd at gmail.com> wrote:
>>> On Thu, Jan 23, 2014 at 11:58 AM,  <josef.pktd at gmail.com> wrote:
>>>>
>>>> No, a view doesn't change the memory, it just changes the
>>>> interpretation and there shouldn't be any conversion involved.
>>>> astype does type conversion, but it goes through ascii encoding which fails.
>>>>
>>>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>>>> b.tostring()
>>>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>>>> b.view('S12')
>>>> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
>>>>       dtype='|S12')
>>>>
>>>> The conversion happens somewhere in the array creation, but I have no
>>>> idea about the memory encoding for uc2 and the low level layouts.
>>
>>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>>>> b[0].tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>>>> 'Õsc'.encode('utf-32LE')
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
>>
>> Is that the encoding for 'U' ?
>
> On a little-endian system, yes. I realise what' happening now. 'U'
> represents unicode characters as a 32-bit unsigned integer giving the
> code point of the character. The first 256 code points are exactly the
> 256 characters representable with latin-1 in the same order.
>
> So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
> latin-1. As a 32 bit integer the code point is 0x000000d5 but in
> little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
> when you reinterpret that as 'S4' it strips the remaining nulls to get
> the byte string b'\xd5'. Which is the latin-1 encoding for the
> character. The same will happen for any string of latin-1 characters.
> However if you do have a code point of 256 or greater then you'll get
> a byte strings of length 2 or more.
>
> On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

I curious consequence of this, if we have only 1 character elements:

>>> a = np.array([si.encode('utf-16LE') for si in ['Õ', 'z']], dtype='S')
>>> a32 = np.array([si.encode('utf-32LE') for si in ['Õ', 'z']], dtype='S')
>>> a[0], a32[0]
(b'\xd5', b'\xd5')
>>> a[0] == a32[0]
True

>>> a32 = np.array([si.encode('utf-32BE') for si in ['Õ', 'z']], dtype='S')
>>> a = np.array([si.encode('utf-16BE') for si in ['Õ', 'z']], dtype='S')
>>> a[0], a32[0]
(b'\x00\xd5', b'\x00\x00\x00\xd5')
>>> a[0] == a32[0]
False

Josef



>
>> another sideeffect of null truncation: cannot decode truncated data
>>
>>>>> b.view('S4').tostring()
>> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>>>> b.view('S4')[0]
>> b'\xd5'
>>>>> b.view('S4')[0].tostring()
>> b'\xd5'
>>>>> b.view('S4')[:1].tostring()
>> b'\xd5\x00\x00\x00'
>>
>>>>> b.view('S4')[0].decode('utf-32LE')
>> Traceback (most recent call last):
>>   File "<pyshell#101>", line 1, in <module>
>>     b.view('S4')[0].decode('utf-32LE')
>>   File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode
>>     return codecs.utf_32_le_decode(input, errors, True)
>> UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
>> 0: truncated data
>>
>>>>> b.view('S4')[:1].tostring().decode('utf-32LE')
>> 'Õ'
>>
>> numpy arrays need a decode and encode method
>
> I'm not sure that they do. Rather there needs to be a text dtype that
> knows what encoding to use in order to have a binary interface as
> exposed by .tostring() and friends and but produce unicode strings
> when indexed from Python code. Having both a text and a binary
> interface to the same data implies having an encoding.
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list