[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 17:40:47 EST 2014

On Fri, Jan 17, 2014 at 4:43 PM, <josef.pktd at gmail.com> wrote:

> On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker <chris.barker at noaa.gov>
> wrote:
> > On Fri, Jan 17, 2014 at 12:36 PM, <josef.pktd at gmail.com> wrote:
> >>
> >> > ('S' ?) -- which is probably not what you want particularly if you
> >> > specify
> >> > an encoding. Though I can't figure out at the moment why the previous
> >> > one
> >> > failed -- where did the bytes object come from when the encoding was
> >> > specified?
> >>
> >> Yes, it's a utf-8 file with nonascii.
> >>
> >> I don't know what I **should** want.
> >
> >
> > well, you **should** want:
> >
> > The numbers parsed out for you (Other wise, why use recfromtxt), and the
> > text as properly decoded unicode strings.
> >
> > Python does very well with unicode -- and you are MUCH happier if you do
> the
> > encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
> > decoding already, converting ascii representation of numbers to an
> internal
> > binary representation -- why not handle  the text at the same time.
> >
> > There certainly are use cases for keeping the text as encoded bytes, but
> I'd
> > say those fall into the categories of:
> >
> > 1) Special case
> > 2) You should know what you are doing.
> >
> > So having recfromtxt auto-determine that for you makes little sense.
> >
> > Note that if you don't know the file encoding, this is tricky. My
> thoughts:
> >
> > 1) don't use the system default encoding!!! (see my other note on that!)
> >
> > 2) Either:
> >     a) open as a binary file and use bytes for anything that doesn't
> parse
> > as text -- this means that the user will need to do the conversion to
> text
> > themselves
> >
> >   b) decode as latin-1: this would work well for ascii and _some_
> non-ascii
> > text, and would be recoverable for ALL text.
> >
> > I prefer (b). The point here is that if the user gets bytes, then they
>  will
> > either have to assume ascii, or need to hand-decode it, and if they just
> > want assume ascii, they have a bytes object with limited text
> functionality
> > so will probably need to decode it anyway (unless they are just passing
> it
> > through)
> >
> > If the user gets unicode objects that are may not properly decoded, they
> can
> > either assume it was ascii, and if they only do ascii-compatible things
> with
> > it, it will work, or they can encode/decode it and get the proper stuff
> > back, but only if they know the encoding, and if that's the case, why did
> > they not specify that in the first place?
> >
> >>
> >> For now I do want bytes, because that's how I changed statsmodels in
> >> the py3 conversion.
> >>
> >> This was just based on the fact that recfromtxt doesn't work with
> >> strings on python 3, so I switched to using bytes following the lead
> >> of numpy.
> >
> >
> > Well, that's really too bad -- it doesn't sound like you wanted bytes, it
> > sounds like you wanted something that didn't crash --  fair enough. But
> the
> > "proper" solution is for recfromtext to support text....
>
> But also solution 2a) is fine for most of the code
> Often it doesn't really matter
>
> >>> dta_4
> array([(1, 2, 3, b'hello', 'hello'),
>        (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
>        (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar',
> 'Õscar')],
>       dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
> 'S10'), ('f4', '<U9')])
>
> >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
> array([[1, 0, 0],
>        [0, 0, 1],
>        [1, 0, 0],
>        [0, 1, 0]])
> >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
> array([[1, 0, 0],
>        [0, 0, 1],
>        [1, 0, 0],
>        [0, 1, 0]])
>
> similar doing a for loop comparing to the uniques.
> bytes are fine and nobody has to tell me what encoding they are using.
>

>From my perspective bytes are not fine, at least if you want to use normal
string literals in Python 3:

In [64]: dat
Out[64]:
array([(1, 2, 3, b'hello', 'hello'),
       (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
       (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar',
'Õscar')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10'),
('f4', '<U9')])

In [65]: dat['f3'] == 'hello'  # this is how I would find "hello" in my
array, FAIL
Out[65]: False

In [66]: dat['f3'] == b'hello'  # OK, I have to use a bytestring literal
Out[66]: array([ True, False,  True, False], dtype=bool)

In [67]: dat['f4'] == 'hello'  # Works as expected for unicode field
Out[67]: array([ True, False,  True, False], dtype=bool)

And then when you want to look at your data it continues to be difficult:

In [80]: 'The 3rd element of f3 is "%s"' % dat['f3'][2]
Out[80]: 'The 3rd element of f3 is "b\'hello\'"'

In [81]: 'The 3rd element of f3 is "%s"' % dat['f3'][2].decode('ascii')  #
SIGH
Out[81]: 'The 3rd element of f3 is "hello"'

+1 for something like the latin-1 or ascii unicode dtype that can make it a
lot easier for things to just work.

- Tom

p.s. I usually use format(), not %.  Alas I ran into what I think is an old
bug:

In [82]: 'The 3rd element of f3 is "{}"'.format(dat['f3'][3])
ERROR: RuntimeError: maximum recursion depth exceeded while calling a
Python object [IPython.core.interactiveshell]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-82-a7f1f486497d> in <module>()
----> 1 'The 3rd element of f3 is "{}"'.format(dat['f3'][3])

RuntimeError: maximum recursion depth exceeded while calling a Python object

>
> It doesn't work so well for pretty printing results, so using there
> latin-1 as you describe above might be a good solution if users don't
> decode to text/string
>
> Josef
>
> >
> >> I'm mainly worried about backwards compatibility, since we have been
> >> using this for 2 or 3 years. It would be easy to change in statsmodels
> >> when gen/recfromtxt is fixed, but I assume there is lots of other code
> >> using similar interpretation of S/bytes in numpy.
> >
> >
> > well, it does sound like enough folks are using 'S' to mean bytes -- too
> > bad, but what can we do now about that?
> >
> > I'd like a 's' for ascii-stings though.
> >
> > -Chris
> >
> > --
> >
> > Christopher Barker, Ph.D.
> > Oceanographer
> >
> > Emergency Response Division
> > NOAA/NOS/OR&R            (206) 526-6959   voice
> > 7600 Sand Point Way NE   (206) 526-6329   fax
> > Seattle, WA  98115       (206) 526-6317   main reception
> >
> > Chris.Barker at noaa.gov
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/a9b67124/attachment.html>