[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Chris Barker chris.barker at noaa.gov
Fri Jan 17 16:55:56 EST 2014


On Fri, Jan 17, 2014 at 1:43 PM, <josef.pktd at gmail.com> wrote:

> > 2) Either:
> >     a) open as a binary file and use bytes for anything that doesn't
> parse
> > as text -- this means that the user will need to do the conversion to
> text
> > themselves
> >
> >   b) decode as latin-1: this would work well for ascii and _some_
> non-ascii
> > text, and would be recoverable for ALL text.
>


> But also solution 2a) is fine for most of the code
> Often it doesn't really matter
>

indeed -- I did list it as an option ;-)


> >>> dta_4
> array([(1, 2, 3, b'hello', 'hello'),
>        (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'),
>        (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar',
> 'Õscar')],
>       dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3',
> 'S10'), ('f4', '<U9')])
>
> >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int)
> array([[1, 0, 0],
>        [0, 0, 1],
>        [1, 0, 0],
>        [0, 1, 0]])
> >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int)
> array([[1, 0, 0],
>        [0, 0, 1],
>        [1, 0, 0],
>        [0, 1, 0]])
>
> similar doing a for loop comparing to the uniques.
> bytes are fine and nobody has to tell me what encoding they are using.
>

and this same operation would work fine if that text was in (possibly
improperly decoded) unicode objects.


> It doesn't work so well for pretty printing results, so using there
> latin-1 as you describe above might be a good solution if users don't
> decode to text/string
>

exactly -- if you really need to work with the text, you need to know the
encoding. Period. End of Story.

If you don't know the encoding then there is still some stuff you can do
with it, so you want something that:

a) won't barf on any input

b) will preserve the bytes if you need to pass them along, or compare them,
or...

Either bytes or latin-1 decoded strings will work for that. bytes are
better, as it's more explicit that you may not have valid text here.
unicode strings are better as you can do stringy things with them. Either
way, you'll need to encode or decode to get full functionality.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140117/a8334098/attachment.html>


More information about the NumPy-Discussion mailing list