[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 09:11:22 EST 2014

On Fri, Jan 17, 2014 at 8:40 AM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
>> <oscar.j.benjamin at gmail.com>wrote:
>>
>> > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
>> > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
>> > > [clip]
>> >
>>
>> > > > For backward compatibility we *cannot* change S.
>> >
>> > Do you mean to say that loadtxt cannot be changed from decoding using
>> > system
>> > default, splitting on newlines and whitespace and then encoding the
>> > substrings
>> > as latin-1?
>> >
>>
>> unicode dtypes have nothing to do with the loadtxt issue. They are not
>> related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than latin-1 or
> (and rightly so!).
>
>> >
>> > An obvious improvement would be along the lines of what Chris Barker
>> > suggested: decode as latin-1, do the processing and then reencode as
>> > latin-1.
>> >
>>
>> no, the right solution is to add an encoding argument.
>> Its a 4 line patch for python2 and a 2 line patch for python3 and the issue
>> is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode, process the
> text and then re-encode it for an array with dtype='S'?
>
> Note that there are two encodings: one for reading from the file and one for
> storing in the array. The former describes the content of the file and the
> latter will be used if I extract a byte-string from the array and pass it to
> any Python API.
>
>> No latin1 de/encoding is required for anything, I don't know why you would
>> want do to that in this context.
>> Does opening latin1 files even work with current loadtxt?
>
> It's the only encoding that works for dtype='S'.
>
>> It currently uses UTF-8 which is to my knowledge not compatible with latin1.
>
> It uses utf-8 (on my system) to read and latin-1 (on any system) to encode and
> store in the array, corrupting any non-ascii characters. Here's a
> demonstration:
>
> $ ipython3
> Python 3.2.3 (default, Sep 25 2013, 18:22:43)
> Type "copyright", "credits" or "license" for more information.
>
> IPython 0.12.1 -- An enhanced Interactive Python.
> ?         -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help      -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
>
> In [1]: with open('Õscar.txt', 'w') as fout: pass
>
> In [2]: import os
>
> In [3]: os.listdir('.')
> Out[3]: ['Õscar.txt']
>
> In [4]: with open('filenames.txt', 'w') as fout:
>    ...:     fout.writelines([f + '\n' for f in os.listdir('.')])
>    ...:
>
> In [5]: with open('filenames.txt') as fin:
>    ...:     print(fin.read())
>    ...:
> filenames.txt
> Õscar.txt
>
>
> In [6]: import numpy
>
> In [7]: filenames = numpy.loadtxt('filenames.txt')
> <snip>
> ValueError: could not convert string to float: b'filenames.txt'
>
> In [8]: filenames = numpy.loadtxt('filenames.txt', dtype='S')
>
> In [9]: filenames
> Out[9]:
> array([b'filenames.txt', b'\xd5scar.txt'],
>       dtype='|S13')
>
> In [10]: open(filenames[1])
> ---------------------------------------------------------------------------
> IOError                                   Traceback (most recent call last)
> /users/enojb/.rcs/tmp/<ipython-input-10-3bf2418688a2> in <module>()
> ----> 1 open(filenames[1])
>
> IOError: [Errno 2] No such file or directory: '\udcd5scar.txt'
>
> In [11]: open('Õscar.txt'.encode('utf-8'))
> Out[11]: <_io.TextIOWrapper name=b'\xc3\x95scar.txt' mode='r' encoding='UTF-8'>

Windows seems to use consistent en/decoding throughout (example run in IDLE)

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32

>>> filenames = numpy.loadtxt('filenames.txt', dtype='S')
>>> filenames
array([b'weighted_kde.py', b'_proportion.log.py', b'__init__.py',
       b'\xd5scar.txt'],
      dtype='|S18')
>>> fn = open(filenames[-1])
>>> fn.read()
'1,2,3,hello\n5,6,7,Õscar\n'
>>> fn
<_io.TextIOWrapper name=b'\xd5scar.txt' mode='r' encoding='cp1252'>

Josef

>
>
> Oscar
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion