[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Chris Barker chris.barker at noaa.gov
Fri Jan 17 15:02:52 EST 2014


On Fri, Jan 17, 2014 at 1:38 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

>
> This thread is getting a little out of hand which is my fault for
> initially mixing different topics in one mail,
>

still a bit mixed ;-)  --  but I think the loadtxt issue requires a lot
less discussion, so we're OK there.

There have been a lot of notes here since I last commented, so I'm going
stick with the loadtxt issues in this note:

- no possibility to specify the encoding of a file in loadtxt
> this is a missing feature, currently it uses the system default which is
> good and should stay that way.
>

I disagree -- I think using the "system encoding" is a bad idea for a
default -- I certainly am far more likely to get data files from some other
system than my own -- and really unlikely to use the "system encoding" for
any data files I write, either.

And I'm not begin english-centered here -- my data files commonly do have
non-ascii code in there, though frankly, they are either a mess or I know
the encoding.

What should be the default?

latin-1

Why? Despite our desire to be non-english-focuses, most of what loadtxt
does is parse files for numbers, maybe with a bit of text. Numbers are
virtually always ascii-compatible (am I wrong about that? -- if so  you'd
damn well better know your encoding!). So it should be an ascii-compatible
encoding.

Why not ascii? -- because then it would barf on non-ascii text in the file
-- really bad idea there.

Why not utf-8 -- this is being *nic centric -- and utf-8 will wrk fine on
ascii, but corrupt non-asci,, non-utf-8 data (i.e. any other encoding.) and
may barf on some of ti too (not sure about that).

latin-1 will never barf on any binary data,  will successfully parse any
numeric data (plus spaces, commas, etc.), and will preserve the bytes of an
non-ascii content in the file.

If you can set the encoding it's not a huge deal what the default is, but I
will recommend that everyone always either sets it to a known encoding or
uses latin-1 -- never the system encoding.

One more point: on my system right now:

In [15]: sys.getdefaultencoding()
Out[15]: 'ascii'

please don't make loadttxt start barfing on files I've been reading just
fine for years....

It is only missing an option to tell it to treat it differently.
> There should be little debate about changing the default, especially not
> using latin1. The system default exists for a good reason.
>

Maybe, maybe not, but I submit that whatever that "good reason" is, it does
not apply here! This is kin dof like datetime64 using the localle timezone
-- makes it useless!


> Note on linux it is UTF-8 which is a good choice. I'm not familiar with
> windows but all programs should at least have the option to use UTF-8 as
> output too.
>

should, yes, so, maybe, but:

a) not all text data files are written recently or by recently updated
software.

b) This is kind of like saying we should have loadtxt default to utf-8,
which wouldn't be the worst idea -- better than system default, but still
not as good as latin-1

This is a simple question: Should the exact same file read fine with the
exact same code on one machine, but not another? I don't think so.

This has nothing to do with indexing or any kind of processing of the numpy
> arrays.
>

agreed.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/822f80dc/attachment.html>


More information about the NumPy-Discussion mailing list