[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 14:18:47 EST 2014

On 17.01.2014 15:12, Julian Taylor wrote:
> On Fri, Jan 17, 2014 at 2:40 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>> wrote:
> 
>     On Fri, Jan 17, 2014 at 02:10:19PM +0100, Julian Taylor wrote:
>     > On Fri, Jan 17, 2014 at 1:44 PM, Oscar Benjamin
>     > <oscar.j.benjamin at gmail.com <mailto:oscar.j.benjamin at gmail.com>>wrote:
>     >
>     > > On Fri, Jan 17, 2014 at 10:59:27AM +0000, Pauli Virtanen wrote:
>     > > > Julian Taylor <jtaylor.debian <at> googlemail.com
>     <http://googlemail.com>> writes:
>     > > > [clip]
>     > >
>     >
>     > > > > For backward compatibility we *cannot* change S.
>     > >
>     > > Do you mean to say that loadtxt cannot be changed from decoding
>     using
>     > > system
>     > > default, splitting on newlines and whitespace and then encoding the
>     > > substrings
>     > > as latin-1?
>     > >
>     >
>     > unicode dtypes have nothing to do with the loadtxt issue. They are not
>     > related.
> 
>     I'm talking about what loadtxt does with the 'S' dtype. As I showed
>     earlier,
>     if the file is not encoded as ascii or latin-1 then the byte strings are
>     corrupted (see below).
> 
>     This is because loadtxt opens the file with the default system
>     encoding (by
>     not explicitly specifying an encoding):
>     https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L732
> 
>     It then processes each line with asbytes() which encodes them as
>     latin-1:
>     https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L784
>     https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
> 
> 
> 
> wow this is just horrible, it might be the source of the bug.
> 
>  
> 
> 
>     Being an English speaker I don't normally use non-ascii characters in
>     filenames but my system (Ubuntu Linux) still uses utf-8 rather than
>     latin-1 or
>     (and rightly so!).
> 
>     > >
>     > > An obvious improvement would be along the lines of what Chris Barker
>     > > suggested: decode as latin-1, do the processing and then reencode as
>     > > latin-1.
>     > >
>     >
>     > no, the right solution is to add an encoding argument.
>     > Its a 4 line patch for python2 and a 2 line patch for python3 and
>     the issue
>     > is solved, I'll file a PR later.
> 
>     What is the encoding argument for? Is it to be used to decode,
>     process the
>     text and then re-encode it for an array with dtype='S'?
> 
> 
> it is only used to decode the file into text, nothing more.
> loadtxt is supposed to load text files, it should never have to deal
> with bytes ever.
> But I haven't looked into the function deeply yet, there might be ugly
> surprises.
> 
> The output of the array is determined by the dtype argument and not by
> the encoding argument.
> 
> Lets please let the loadtxt issue go to rest.
> We know the issue, we know it can be fixed without adding anything
> complicated to numpy.
> We just have to use what python already provides us.
> The technical details of the fix can be discussed in the github issue.
> (Plan to have a look this weekend, but if someone else wants to do it
> let me know).
> 

Work in progress PR:
https://github.com/numpy/numpy/pull/4208

I also seem to have fixed the original bug, while wasn't even my
intention with that PR :)
apparently it was indeed one of the broken asbytes calls.

if you have applications using loadtxt please give it a try, but
genfromtxt is still completely broken (and a much larger fix, asbytes
everywhere)