[Numpy-discussion] using loadtxt to load a text file in to a numpy array
chris.barker at noaa.gov
Fri Jan 17 15:02:52 EST 2014
On Fri, Jan 17, 2014 at 1:38 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
> This thread is getting a little out of hand which is my fault for
> initially mixing different topics in one mail,
still a bit mixed ;-) -- but I think the loadtxt issue requires a lot
less discussion, so we're OK there.
There have been a lot of notes here since I last commented, so I'm going
stick with the loadtxt issues in this note:
- no possibility to specify the encoding of a file in loadtxt
> this is a missing feature, currently it uses the system default which is
> good and should stay that way.
I disagree -- I think using the "system encoding" is a bad idea for a
default -- I certainly am far more likely to get data files from some other
system than my own -- and really unlikely to use the "system encoding" for
any data files I write, either.
And I'm not begin english-centered here -- my data files commonly do have
non-ascii code in there, though frankly, they are either a mess or I know
What should be the default?
Why? Despite our desire to be non-english-focuses, most of what loadtxt
does is parse files for numbers, maybe with a bit of text. Numbers are
virtually always ascii-compatible (am I wrong about that? -- if so you'd
damn well better know your encoding!). So it should be an ascii-compatible
Why not ascii? -- because then it would barf on non-ascii text in the file
-- really bad idea there.
Why not utf-8 -- this is being *nic centric -- and utf-8 will wrk fine on
ascii, but corrupt non-asci,, non-utf-8 data (i.e. any other encoding.) and
may barf on some of ti too (not sure about that).
latin-1 will never barf on any binary data, will successfully parse any
numeric data (plus spaces, commas, etc.), and will preserve the bytes of an
non-ascii content in the file.
If you can set the encoding it's not a huge deal what the default is, but I
will recommend that everyone always either sets it to a known encoding or
uses latin-1 -- never the system encoding.
One more point: on my system right now:
In : sys.getdefaultencoding()
please don't make loadttxt start barfing on files I've been reading just
fine for years....
It is only missing an option to tell it to treat it differently.
> There should be little debate about changing the default, especially not
> using latin1. The system default exists for a good reason.
Maybe, maybe not, but I submit that whatever that "good reason" is, it does
not apply here! This is kin dof like datetime64 using the localle timezone
-- makes it useless!
> Note on linux it is UTF-8 which is a good choice. I'm not familiar with
> windows but all programs should at least have the option to use UTF-8 as
> output too.
should, yes, so, maybe, but:
a) not all text data files are written recently or by recently updated
b) This is kind of like saying we should have loadtxt default to utf-8,
which wouldn't be the worst idea -- better than system default, but still
not as good as latin-1
This is a simple question: Should the exact same file read fine with the
exact same code on one machine, but not another? I don't think so.
This has nothing to do with indexing or any kind of processing of the numpy
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion