[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 15:36:12 EST 2014

On Fri, Jan 17, 2014 at 3:17 PM, Chris Barker <chris.barker at noaa.gov> wrote:
>  >>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
> delimiter=',')
>>
>> Traceback (most recent call last):
>>   File "<pyshell#251>", line 1, in <module>
>>     numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
>> delimiter=',')
>>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
>> line 1828, in recfromtxt
>>     output = genfromtxt(fname, **kwargs)
>>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
>> line 1351, in genfromtxt
>>     first_values = split_line(first_line)
>>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
>> line 207, in _delimited_splitter
>>     line = line.split(self.comments)[0]
>> TypeError: Can't convert 'bytes' object to str implicitly
>
>
> That's pretty broken -- if you know the encoding, you should certainly be
> able to get a proper unicode string out of it..
>
>>
>> >>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',')
>> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
>>        (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
>>       dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])
>
>
> So the problem here is that recfromtxt is making all "text" bytes objects.
> ('S' ?) -- which is probably not what you want particularly if you specify
> an encoding. Though I can't figure out at the moment why the previous one
> failed -- where did the bytes object come from when the encoding was
> specified?

Yes, it's a utf-8 file with nonascii.

I don't know what I **should** want.

For now I do want bytes, because that's how I changed statsmodels in
the py3 conversion.

This was just based on the fact that recfromtxt doesn't work with
strings on python 3, so I switched to using bytes following the lead
of numpy.

I'm mainly worried about backwards compatibility, since we have been
using this for 2 or 3 years. It would be easy to change in statsmodels
when gen/recfromtxt is fixed, but I assume there is lots of other code
using similar interpretation of S/bytes in numpy.

Josef

>
> By the way -- this is apparently a utf-file with some non-ascii text in it.
> By my proposal, without an encoding specified, it should default to latin-1:
>
> In that case, you might get unicode string objects that are incorrectly
> decoded. But:
>
> it would not raise an exception
>
> you could recover the proper text with:
>
> the_text.encode(latin-1).decode('utf-8')
>
> On the other hand, if this was as ascii-compatible non-utf8 encoding file,
> and we tried to read it as utf-8, it could barf on the non-ascii text
> altogether, and if it didn't the non-ascii text would be corrupted and
> impossible to recover.
>
> I think the issue is that I'm not really proposing latin-1 --  I'm proposing
> "a ascii compatible encoding that will do the right thing with ascii bytes,
> and pass through any other bytes untouched" - latin-1, at least as
> implemented by Python, satisfies that criterium.
>
> -Chris
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>