[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Fri Jan 17 16:43:58 EST 2014

On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker <chris.barker at noaa.gov> wrote:
> On Fri, Jan 17, 2014 at 12:36 PM, <josef.pktd at gmail.com> wrote:
>>
>> > ('S' ?) -- which is probably not what you want particularly if you
>> > specify
>> > an encoding. Though I can't figure out at the moment why the previous
>> > one
>> > failed -- where did the bytes object come from when the encoding was
>> > specified?
>>
>> Yes, it's a utf-8 file with nonascii.
>>
>> I don't know what I **should** want.
>
>
> well, you **should** want:
>
> The numbers parsed out for you (Other wise, why use recfromtxt), and the
> text as properly decoded unicode strings.
>
> Python does very well with unicode -- and you are MUCH happier if you do the
> encoding/decoding as close to I/O as possible. recfromtxt is, in a way,
> decoding already, converting ascii representation of numbers to an internal
> binary representation -- why not handle  the text at the same time.
>
> There certainly are use cases for keeping the text as encoded bytes, but I'd
> say those fall into the categories of:
>
> 1) Special case
> 2) You should know what you are doing.
>
> So having recfromtxt auto-determine that for you makes little sense.
>
> Note that if you don't know the file encoding, this is tricky. My thoughts:
>
> 1) don't use the system default encoding!!! (see my other note on that!)
>
> 2) Either:
>     a) open as a binary file and use bytes for anything that doesn't parse
> as text -- this means that the user will need to do the conversion to text
> themselves
>
>   b) decode as latin-1: this would work well for ascii and _some_ non-ascii
> text, and would be recoverable for ALL text.
>
> I prefer (b). The point here is that if the user gets bytes, then they  will
> either have to assume ascii, or need to hand-decode it, and if they just
> want assume ascii, they have a bytes object with limited text functionality
> so will probably need to decode it anyway (unless they are just passing it
> through)
>
> If the user gets unicode objects that are may not properly decoded, they can
> either assume it was ascii, and if they only do ascii-compatible things with
> it, it will work, or they can encode/decode it and get the proper stuff
> back, but only if they know the encoding, and if that's the case, why did
> they not specify that in the first place?
>
>>
>> For now I do want bytes, because that's how I changed statsmodels in
>> the py3 conversion.
>>
>> This was just based on the fact that recfromtxt doesn't work with
>> strings on python 3, so I switched to using bytes following the lead
>> of numpy.
>
>
> Well, that's really too bad -- it doesn't sound like you wanted bytes, it
> sounds like you wanted something that didn't crash --  fair enough. But the
> "proper" solution is for recfromtext to support text....

But also solution 2a) is fine for most of the code
Often it doesn't really matter

>>> dta_4
array([(1, 2, 3, b'hello', 'hello'),
       (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
       (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', 'Õscar')],
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
'S10'), ('f4', '<U9')])

>>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
array([[1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])
>>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
array([[1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

similar doing a for loop comparing to the uniques.
bytes are fine and nobody has to tell me what encoding they are using.

It doesn't work so well for pretty printing results, so using there
latin-1 as you describe above might be a good solution if users don't
decode to text/string

Josef

>
>> I'm mainly worried about backwards compatibility, since we have been
>> using this for 2 or 3 years. It would be easy to change in statsmodels
>> when gen/recfromtxt is fixed, but I assume there is lots of other code
>> using similar interpretation of S/bytes in numpy.
>
>
> well, it does sound like enough folks are using 'S' to mean bytes -- too
> bad, but what can we do now about that?
>
> I'd like a 's' for ascii-stings though.
>
> -Chris
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>