[Numpy-discussion] using loadtxt to load a text file in to a numpy array
Julian Taylor
jtaylor.debian at googlemail.com
Fri Jan 17 14:18:47 EST 2014
On 17.01.2014 15:12, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>> wrote:
>
> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> > <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>>wrote:
> >
> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > > Julian Taylor <jtaylor.debian <at> googlemail.com
> <http://googlemail.com>> writes:
> > > > [clip]
> > >
> >
> > > > > For backward compatibility we *cannot* change S.
> > >
> > > Do you mean to say that loadtxt cannot be changed from decoding
> using
> > > system
> > > default, splitting on newlines and whitespace and then encoding the
> > > substrings
> > > as latin-1?
> > >
> >
> > unicode dtypes have nothing to do with the loadtxt issue. They are not
> > related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed
> earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system
> encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as
> latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
>
>
> wow this is just horrible, it might be the source of the bug.
>
>
>
>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
> latin-1 or
> (and rightly so!).
>
> > >
> > > An obvious improvement would be along the lines of what Chris Barker
> > > suggested: decode as latin-1, do the processing and then reencode as
> > > latin-1.
> > >
> >
> > no, the right solution is to add an encoding argument.
> > Its a 4 line patch for python2 and a 2 line patch for python3 and
> the issue
> > is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode,
> process the
> text and then re-encode it for an array with dtype='S'?
>
>
> it is only used to decode the file into text, nothing more.
> loadtxt is supposed to load text files, it should never have to deal
> with bytes ever.
> But I haven't looked into the function deeply yet, there might be ugly
> surprises.
>
> The output of the array is determined by the dtype argument and not by
> the encoding argument.
>
> Lets please let the loadtxt issue go to rest.
> We know the issue, we know it can be fixed without adding anything
> complicated to numpy.
> We just have to use what python already provides us.
> The technical details of the fix can be discussed in the github issue.
> (Plan to have a look this weekend, but if someone else wants to do it
> let me know).
>
Work in progress PR:
https://github.com/numpy/numpy/pull/4208
I also seem to have fixed the original bug, while wasn't even my
intention with that PR :)
apparently it was indeed one of the broken asbytes calls.
if you have applications using loadtxt please give it a try, but
genfromtxt is still completely broken (and a much larger fix, asbytes
everywhere)
More information about the NumPy-Discussion
mailing list