[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Oscar Benjamin oscar.j.benjamin at gmail.com
Fri Jan 17 08:40:34 EST 2014


On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
> 
> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > > [clip]
> >
> 
> > > > For backward compatibility we *cannot* change S.
> >
> > Do you mean to say that loadtxt cannot be changed from decoding using
> > system
> > default, splitting on newlines and whitespace and then encoding the
> > substrings
> > as latin-1?
> >
> 
> unicode dtypes have nothing to do with the loadtxt issue. They are not
> related.

I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier,
if the file is not encoded as ascii or latin-1 then the byte strings are
corrupted (see below).

This is because loadtxt opens the file with the default system encoding (by
not explicitly specifying an encoding):
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732

It then processes each line with asbytes() which encodes them as latin-1:
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28

Being an English speaker I don't normally use non-ascii characters in
filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or
(and rightly so!).

> >
> > An obvious improvement would be along the lines of what Chris Barker
> > suggested: decode as latin-1, do the processing and then reencode as
> > latin-1.
> >
> 
> no, the right solution is to add an encoding argument.
> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
> is solved, I'll file a PR later.

What is the encoding argument for? Is it to be used to decode, process the
text and then re-encode it for an array with dtype='S'?

Note that there are two encodings: one for reading from the file and one for
storing in the array. The former describes the content of the file and the
latter will be used if I extract a byte-string from the array and pass it to
any Python API.

> No latin1 de/encoding is required for anything, I don't know why you would
> want do to that in this context.
> Does opening latin1 files even work with current loadtxt?

It's the only encoding that works for dtype='S'.

> It currently uses UTF-8 which is to my knowledge not compatible with latin1.

It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and
store in the array, corrupting any non-ascii characters. Here's a
demonstration:

$ ipython3
Python 3.2.3 (default, Sep 25 2013, 18:22:43) 
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: with open('Õscar.txt', 'w') as fout: pass

In [2]: import os

In [3]: os.listdir('.')
Out[3]: ['Õscar.txt']

In [4]: with open('filenames.txt', 'w') as fout:
   ...:     fout.writelines([f + '\n' for f in os.listdir('.')])
   ...:     

In [5]: with open('filenames.txt') as fin:
   ...:     print(fin.read())
   ...:     
filenames.txt
Õscar.txt


In [6]: import numpy

In [7]: filenames = numpy.loadtxt('filenames.txt')
<snip>
ValueError: could not convert string to float: b'filenames.txt'

In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S')

In [9]: filenames
Out[9]: 
array([b'filenames.txt', b'\xd5scar.txt'], 
      dtype='|S13')

In [10]: open(filenames[1])
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
/users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>()
----> 1 open(filenames[1])

IOError: [Errno 2] No such file or directory: '\udcd5scar.txt'

In [11]: open('Õscar.txt'.encode('utf-8'))
Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'>


Oscar



More information about the NumPy-Discussion mailing list