[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Chris Barker chris.barker at noaa.gov
Fri Jan 17 15:17:58 EST 2014


 >>> numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
delimiter=',')

> Traceback (most recent call last):
>   File "<pyshell#251>", line 1, in <module>
>     numpy.recfromtxt(open('Õscar_3.txt',"r", encoding='utf8'),
> delimiter=',')
>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
> line 1828, in recfromtxt
>     output = genfromtxt(fname, **kwargs)
>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\npyio.py",
> line 1351, in genfromtxt
>     first_values = split_line(first_line)
>   File "C:\Programs\Python33\lib\site-packages\numpy\lib\_iotools.py",
> line 207, in _delimited_splitter
>     line = line.split(self.comments)[0]
> TypeError: Can't convert 'bytes' object to str implicitly
>

That's pretty broken -- if you know the encoding, you should certainly be
able to get a proper unicode string out of it..


> >>> numpy.recfromtxt(open('Õscar_3.txt',"rb"), delimiter=',')
> rec.array([(1, 2, 3, b'hello'), (5, 6, 7, b'\xc3\x95scarscar'),
>        (15, 2, 3, b'hello'), (20, 2, 3, b'\xc3\x95scar')],
>       dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10')])
>

So the problem here is that recfromtxt is making all "text" bytes objects.
('S' ?) -- which is probably not what you want particularly if you specify
an encoding. Though I can't figure out at the moment why the previous one
failed -- where did the bytes object come from when the encoding was
specified?

By the way -- this is apparently a utf-file with some non-ascii text in it.
By my proposal, without an encoding specified, it should default to latin-1:

In that case, you might get unicode string objects that are incorrectly
decoded. But:

it would not raise an exception

you could recover the proper text with:

the_text.encode(latin-1).decode('utf-8')

On the other hand, if this was as ascii-compatible non-utf8 encoding file,
and we tried to read it as utf-8, it could barf on the non-ascii text
altogether, and if it didn't the non-ascii text would be corrupted and
impossible to recover.

I think the issue is that I'm not really proposing latin-1 --  I'm
proposing "a ascii compatible encoding that will do the right thing with
ascii bytes, and pass through any other bytes untouched" - latin-1, at
least as implemented by Python, satisfies that criterium.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/02bb62e4/attachment.html>


More information about the NumPy-Discussion mailing list