[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Oscar Benjamin oscar.j.benjamin at gmail.com
Thu Jan 23 11:43:09 EST 2014


On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.pktd at gmail.com wrote:
> 
> another curious example, encode utf-8 to latin-1 bytes
> 
> >>> b
> array(['Õsc', 'zxc'],
>       dtype='<U3')
> >>> b[0].encode('utf8')
> b'\xc3\x95sc'
> >>> b[0].encode('latin1')
> b'\xd5sc'
> >>> b.astype('S')
> Traceback (most recent call last):
>   File "<pyshell#40>", line 1, in <module>
>     b.astype('S')
> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
> position 0: ordinal not in range(128)
> >>> c = b.view('S4').astype('S1').view('S3')
> >>> c
> array([b'\xd5sc', b'zxc'],
>       dtype='|S3')
> >>> c[0].decode('latin1')
> 'Õsc'

Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
ascii:

>>> np.array(['Õsc']).astype('S4')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128)
>>> np.array(['Õsc']).view('S4')
array([b'\xd5', b's', b'c'], 
      dtype='|S4')

> --------
> The original numpy py3 conversion used latin-1 as default
> (It's still used in statsmodels, and I haven't looked at the structure
> under the common py2-3 codebase)
> 
> if sys.version_info[0] >= 3:
>     import io
>     bytes = bytes
>     unicode = str
>     asunicode = str

These two functions are an abomination:

>     def asbytes(s):
>         if isinstance(s, bytes):
>             return s
>         return s.encode('latin1')
>     def asstr(s):
>         if isinstance(s, str):
>             return s
>         return s.decode('latin1')


Oscar



More information about the NumPy-Discussion mailing list