[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 16:20:39 EST 2014

On Fri, Jan 17, 2014 at 12:36 PM, <josef.pktd at gmail.com> wrote:

> > ('S' ?) -- which is probably not what you want particularly if you
> specify
> > an encoding. Though I can't figure out at the moment why the previous one
> > failed -- where did the bytes object come from when the encoding was
> > specified?
>
> Yes, it's a utf-8 file with nonascii.
>
> I don't know what I **should** want.
>

well, you **should** want:

The numbers parsed out for you (Other wise, why use recfromtxt), and the
text as properly decoded unicode strings.

Python does very well with unicode -- and you are MUCH happier if you do
the encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
decoding already, converting ascii representation of numbers to an internal
binary representation -- why not handle  the text at the same time.

There certainly are use cases for keeping the text as encoded bytes, but
I'd say those fall into the categories of:

1) Special case
2) You should know what you are doing.

So having recfromtxt auto-determine that for you makes little sense.

Note that if you don't know the file encoding, this is tricky. My thoughts:

1) don't use the system default encoding!!! (see my other note on that!)

2) Either:
    a) open as a binary file and use bytes for anything that doesn't parse
as text -- this means that the user will need to do the conversion to text
themselves

  b) decode as latin-1: this would work well for ascii and _some_ non-ascii
text, and would be recoverable for ALL text.

I prefer (b). The point here is that if the user gets bytes, then they
 will either have to assume ascii, or need to hand-decode it, and if they
just want assume ascii, they have a bytes object with limited
text functionality  so will probably need to decode it anyway (unless they
are just passing it through)

If the user gets unicode objects that are may not properly decoded, they
can either assume it was ascii, and if they only do ascii-compatible things
with it, it will work, or they can encode/decode it and get the proper
stuff back, but only if they know the encoding, and if that's the case, why
did they not specify that in the first place?

> For now I do want bytes, because that's how I changed statsmodels in
> the py3 conversion.
>
> This was just based on the fact that recfromtxt doesn't work with
> strings on python 3, so I switched to using bytes following the lead
> of numpy.
>

Well, that's really too bad -- it doesn't sound like you wanted bytes, it
sounds like you wanted something that didn't crash --  fair enough. But the
"proper" solution is for recfromtext to support text....

I'm mainly worried about backwards compatibility, since we have been
> using this for 2 or 3 years. It would be easy to change in statsmodels
> when gen/recfromtxt is fixed, but I assume there is lots of other code
> using similar interpretation of S/bytes in numpy.
>

well, it does sound like enough folks are using 'S' to mean bytes -- too
bad, but what can we do now about that?

I'd like a 's' for ascii-stings though.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/454e3c26/attachment.html>