[Numpy-discussion] using loadtxt to load a text file in to a numpy array
oscar.j.benjamin at gmail.com
Thu Jan 23 20:41:24 EST 2014
On 24 January 2014 01:09, Chris Barker <chris.barker at noaa.gov> wrote:
> On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com>
>> On 23 January 2014 21:51, Chris Barker <chris.barker at noaa.gov> wrote:
>> > However, I would prefer latin-1 -- that way you might get garbage for
>> > the
>> > non-ascii parts, but it wouldn't raise an exception and it round-trips
>> > through encoding/decoding. And you would have a somewhat more useful
>> > subset
>> > -- including the latin-language character and symbols like the degree
>> > symbol, etc.
>> Exceptions and error messages are a good thing! Garbage is not!!! :)
> in principle, I agree with you, but sometime practicality beets purity.
> in py2 there is a lot of implicit encoding/decoding going on, using the
> system encoding. That is ascii on a lot of systems. The result is that there
> is a lot of code out there that folks have ported to use unicode, but missed
> a few corners. If that code is only testes with ascii, it all seems o be
> working but then out in the wild someone puts another character in there and
> presto -- a crash.
Precisely. The Py3 text model uses TypeErrors to warn early against
this kind of thing. No longer do you have code that seems to work
until the wrong character goes in. You get the error straight away
when you try to mix bytes and text. You still have the option to
silence those errors: it just needs to be done explicitly:
>>> s = 'Õscar'
>>> s.encode('ascii', errors='replace')
> Also, there are places where the inability to encode makes silent message --
> for instance if an Exception is raised with a unicode message, it will get
> silently dropped when it comes time to display on the terminal. I spent
> quite a wile banging my head against that one recently when I tried to
> update some code to read unicode files. I would have been MUCH happier with
> a bit of garbage in the mesae than having it drop (or raise an encoding
> error in the middle of the error...)
Yeah, that's just a bug in CPython. I think it's fixed now but either
way you're right: for the particular case of displaying error messages
the interpreter should do whatever it takes to get some kind of error
message out even if it's a bit garbled. I disagree that this should be
the basis for ordinary data processing with numpy though.
> I think this is a bad thing.
> The advantage of latin-1 is that while you might get something that doesn't
> print right, it won't crash, and it won't contaminate the data, so
> comparisons, etc, will still work. kind of like using utf-8 in an old-style
> c char array -- you can still passi t around and copare it, even if the
> bytes dont mean what you think they do.
It round trips okay as long as you don't try to do anything else with
the string. So does the textarray class I proposed in a new thread: If
you just use fromfile and tofile it works fine for any input (except
for trailing nulls) but if you try to decode invalid bytes it will
throw errors. It wouldn't be hard to add configurable error-handling
More information about the NumPy-Discussion