[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 23 11:58:38 EST 2014


On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote:
>>
>> another curious example, encode utf-8 to latin-1 bytes
>>
>> >>> b
>> array(['Õsc', 'zxc'],
>>       dtype='<U3')
>> >>> b[0].encode('utf8')
>> b'\xc3\x95sc'
>> >>> b[0].encode('latin1')
>> b'\xd5sc'
>> >>> b.astype('S')
>> Traceback (most recent call last):
>>   File "<pyshell#40>", line 1, in <module>
>>     b.astype('S')
>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
>> position 0: ordinal not in range(128)
>> >>> c = b.view('S4').astype('S1').view('S3')
>> >>> c
>> array([b'\xd5sc', b'zxc'],
>>       dtype='|S3')
>> >>> c[0].decode('latin1')
>> 'Õsc'
>
> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
> ascii:
>
>>>> np.array(['Õsc']).astype('S4')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>>> np.array(['Õsc']).view('S4')
> array([b'\xd5', b's', b'c'],
>       dtype='|S4')


No, a view doesn't change the memory, it just changes the
interpretation and there shouldn't be any conversion involved.
astype does type conversion, but it goes through ascii encoding which fails.

>>> b = np.array(['Õsc', 'zxc'], dtype='<U3')
>>> b.tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
>>> b.view('S12')
array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
      dtype='|S12')

The conversion happens somewhere in the array creation, but I have no
idea about the memory encoding for uc2 and the low level layouts.

Josef

>
>> --------
>> The original numpy py3 conversion used latin-1 as default
>> (It's still used in statsmodels, and I haven't looked at the structure
>> under the common py2-3 codebase)
>>
>> if sys.version_info[0] >= 3:
>>     import io
>>     bytes = bytes
>>     unicode = str
>>     asunicode = str
>
> These two functions are an abomination:
>
>>     def asbytes(s):
>>         if isinstance(s, bytes):
>>             return s
>>         return s.encode('latin1')
>>     def asstr(s):
>>         if isinstance(s, str):
>>             return s
>>         return s.decode('latin1')
>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list