[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Julian Taylor jtaylor.debian at googlemail.com
Fri Jan 17 09:12:32 EST 2014


On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
<oscar.j.benjamin at gmail.com>wrote:

> On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
> > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
> > <oscar.j.benjamin at gmail.com>wrote:
> >
> > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
> > > > Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
> > > > [clip]
> > >
> >
> > > > > For backward compatibility we *cannot* change S.
> > >
> > > Do you mean to say that loadtxt cannot be changed from decoding using
> > > system
> > > default, splitting on newlines and whitespace and then encoding the
> > > substrings
> > > as latin-1?
> > >
> >
> > unicode dtypes have nothing to do with the loadtxt issue. They are not
> > related.
>
> I'm talking about what loadtxt does with the 'S' dtype. As I showed
> earlier,
> if the file is not encoded as ascii or latin-1 then the byte strings are
> corrupted (see below).
>
> This is because loadtxt opens the file with the default system encoding (by
> not explicitly specifying an encoding):
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
>
> It then processes each line with asbytes() which encodes them as latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>


wow this is just horrible, it might be the source of the bug.



>
> Being an English speaker I don't normally use non-ascii characters in
> filenames but my system (Ubuntu Linux) still uses utf-8 rather than
> latin-1 or
> (and rightly so!).
>
> > >
> > > An obvious improvement would be along the lines of what Chris Barker
> > > suggested: decode as latin-1, do the processing and then reencode as
> > > latin-1.
> > >
> >
> > no, the right solution is to add an encoding argument.
> > Its a 4 line patch for python2 and a 2 line patch for python3 and the
> issue
> > is solved, I'll file a PR later.
>
> What is the encoding argument for? Is it to be used to decode, process the
> text and then re-encode it for an array with dtype='S'?
>

it is only used to decode the file into text, nothing more.
loadtxt is supposed to load text files, it should never have to deal with
bytes ever.
But I haven't looked into the function deeply yet, there might be ugly
surprises.

The output of the array is determined by the dtype argument and not by the
encoding argument.

Lets please let the loadtxt issue go to rest.
We know the issue, we know it can be fixed without adding anything
complicated to numpy.
We just have to use what python already provides us.
The technical details of the fix can be discussed in the github issue.
(Plan to have a look this weekend, but if someone else wants to do it let
me know).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/ac048439/attachment.html>


More information about the NumPy-Discussion mailing list