[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Fri Jan 17 14:58:21 EST 2014


On Fri, Jan 17, 2014 at 2:18 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On 17.01.2014 15:12, Julian Taylor wrote:
>> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
>> <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>> wrote:
>>
>>     On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>>     > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
>>     > <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>>wrote:
>>     >
>>     > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
>>     > > > Julian Taylor <jtaylor.debian <at> googlemail.com
>>     <http://googlemail.com>> writes:
>>     > > > [clip]
>>     > >
>>     >
>>     > > > > For backward compatibility we *cannot* change S.
>>     > >
>>     > > Do you mean to say that loadtxt cannot be changed from decoding
>>     using
>>     > > system
>>     > > default, splitting on newlines and whitespace and then encoding the
>>     > > substrings
>>     > > as latin-1?
>>     > >
>>     >
>>     > unicode dtypes have nothing to do with the loadtxt issue. They are not
>>     > related.
>>
>>     I'm talking about what loadtxt does with the 'S' dtype. As I showed
>>     earlier,
>>     if the file is not encoded as ascii or latin-1 then the byte strings are
>>     corrupted (see below).
>>
>>     This is because loadtxt opens the file with the default system
>>     encoding (by
>>     not explicitly specifying an encoding):
>>     https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>>
>>     It then processes each line with asbytes() which encodes them as
>>     latin-1:
>>     https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
>>     https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>>
>>
>>
>> wow this is just horrible, it might be the source of the bug.
>>
>>
>>
>>
>>     Being an English speaker I don't normally use non-ascii characters in
>>     filenames but my system (Ubuntu Linux) still uses utf-8 rather than
>>     latin-1 or
>>     (and rightly so!).
>>
>>     > >
>>     > > An obvious improvement would be along the lines of what Chris Barker
>>     > > suggested: decode as latin-1, do the processing and then reencode as
>>     > > latin-1.
>>     > >
>>     >
>>     > no, the right solution is to add an encoding argument.
>>     > Its a 4 line patch for python2 and a 2 line patch for python3 and
>>     the issue
>>     > is solved, I'll file a PR later.
>>
>>     What is the encoding argument for? Is it to be used to decode,
>>     process the
>>     text and then re-encode it for an array with dtype='S'?
>>
>>
>> it is only used to decode the file into text, nothing more.
>> loadtxt is supposed to load text files, it should never have to deal
>> with bytes ever.
>> But I haven't looked into the function deeply yet, there might be ugly
>> surprises.
>>
>> The output of the array is determined by the dtype argument and not by
>> the encoding argument.
>>
>> Lets please let the loadtxt issue go to rest.
>> We know the issue, we know it can be fixed without adding anything
>> complicated to numpy.
>> We just have to use what python already provides us.
>> The technical details of the fix can be discussed in the github issue.
>> (Plan to have a look this weekend, but if someone else wants to do it
>> let me know).
>>
>
> Work in progress PR:
> https://github.com/numpy/numpy/pull/4208
>
> I also seem to have fixed the original bug, while wasn't even my
> intention with that PR :)
> apparently it was indeed one of the broken asbytes calls.
>
> if you have applications using loadtxt please give it a try, but
> genfromtxt is still completely broken (and a much larger fix, asbytes
> everywhere)

does this still work?

>>> numpy.loadtxt(open('Õscar_3.txt',"rb"), 'S')
array([b'1,2,3,hello', b'5,6,7,\xc3\x95scarscar', b'15,2,3,hello',
       b'20,2,3,\xc3\x95scar'],
      dtype='|S16')

to compare

>>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',')
Traceback (most recent call last):
  File "<pyshell#251>", line 1, in <module>
    numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'), delimiter=',')
  File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
line 1828, in recfromtxt
    output = genfromtxt(fname, **kwargs)
  File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
line 1351, in genfromtxt
    first_values = split_line(first_line)
  File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
line 207, in _delimited_splitter
    line = line.split(self.comments)[0]
TypeError: Can't convert 'bytes' object to str implicitly

>>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',')
rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
       (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])

Josef

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list