[Numpy-discussion] using loadtxt to load a text file in to a numpy array
Oscar Benjamin
oscar.j.benjamin at gmail.com
Fri Jan 17 08:40:34 EST 2014
On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com>wrote:
>
> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > > [clip]
> >
>
> > > > For backward compatibility we *cannot* change S.
> >
> > Do you mean to say that loadtxt cannot be changed from decoding using
> > system
> > default, splitting on newlines and whitespace and then encoding the
> > substrings
> > as latin-1?
> >
>
> unicode dtypes have nothing to do with the loadtxt issue. They are not
> related.
I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier,
if the file is not encoded as ascii or latin-1 then the byte strings are
corrupted (see below).
This is because loadtxt opens the file with the default system encoding (by
not explicitly specifying an encoding):
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
It then processes each line with asbytes() which encodes them as latin-1:
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
Being an English speaker I don't normally use non-ascii characters in
filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or
(and rightly so!).
> >
> > An obvious improvement would be along the lines of what Chris Barker
> > suggested: decode as latin-1, do the processing and then reencode as
> > latin-1.
> >
>
> no, the right solution is to add an encoding argument.
> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
> is solved, I'll file a PR later.
What is the encoding argument for? Is it to be used to decode, process the
text and then re-encode it for an array with dtype='S'?
Note that there are two encodings: one for reading from the file and one for
storing in the array. The former describes the content of the file and the
latter will be used if I extract a byte-string from the array and pass it to
any Python API.
> No latin1 de/encoding is required for anything, I don't know why you would
> want do to that in this context.
> Does opening latin1 files even work with current loadtxt?
It's the only encoding that works for dtype='S'.
> It currently uses UTF-8 which is to my knowledge not compatible with latin1.
It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and
store in the array, corrupting any non-ascii characters. Here's a
demonstration:
$ ipython3
Python 3.2.3 (default, Sep 25 2013, 18:22:43)
Type "copyright", "credits" or "license" for more information.
IPython 0.12.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: with open('Õscar.txt', 'w') as fout: pass
In [2]: import os
In [3]: os.listdir('.')
Out[3]: ['Õscar.txt']
In [4]: with open('filenames.txt', 'w') as fout:
...: fout.writelines([f + '\n' for f in os.listdir('.')])
...:
In [5]: with open('filenames.txt') as fin:
...: print(fin.read())
...:
filenames.txt
Õscar.txt
In [6]: import numpy
In [7]: filenames = numpy.loadtxt('filenames.txt')
<snip>
ValueError: could not convert string to float: b'filenames.txt'
In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S')
In [9]: filenames
Out[9]:
array([b'filenames.txt', b'\xd5scar.txt'],
dtype='|S13')
In [10]: open(filenames[1])
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
/users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>()
----> 1 open(filenames[1])
IOError: [Errno 2] No such file or directory: '\udcd5scar.txt'
In [11]: open('Õscar.txt'.encode('utf-8'))
Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'>
Oscar
More information about the NumPy-Discussion
mailing list