[Numpy-discussion] using loadtxt to load a text file in to a numpy array

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 23 18:56:36 EST 2014


On Thu, Jan 23, 2014 at 4:51 PM, Chris Barker <chris.barker at noaa.gov> wrote:
> On Thu, Jan 23, 2014 at 12:10 PM, <josef.pktd at gmail.com> wrote:
>>
>> > Exactly -- but what should those conversion/casting rules be? We can't
>> > decide that unless we decide if 'S' is for text or for arbitrary bytes
>> > -- it
>> > can't be both. I say text, that's what it's mostly trying to do already.
>> > But
>> > if it's bytes, fine, then some things still need cleaning up, and we
>> > could
>> > really use a one-byte-text type.  and if it's text, then we may need a
>> > bytes
>> > dtype.
>>
>> (remember I'm just a balcony muppet)
>
>
> me too ;-)
>
>
>>
>> As far as I understand all codecs have the same ascii part.
>
>
> nope -- certainly not multi-byte codecs. And one of the key points of utf-8
> is that the ascii part is compatible -- none of teh other full-unicode
> encoding are.
>
> many of the one-byte-per-char ones do share the ascii part, but not all, or
> not completely.
>
>> So I would
>> cast on ascii and raise on anything else.
>
>
> still a fine option -- clearly defined and quite useful for scientific text.
> However, I would prefer latin-1 -- that way  you  might get garbage for the
> non-ascii parts, but it wouldn't raise an exception and it round-trips
> through encoding/decoding. And you would have a somewhat more useful subset
> -- including the latin-language character and symbols like the degree
> symbol, etc.

I'm not sure anymore, after all these threads I think bytes should be
bytes and strings should be strings

>>> x = np.array(['hugo'], 'S')
Traceback (most recent call last):
  File "<pyshell#61>", line 1, in <module>
    x = np.array(['hugo'], float)
ValueError: could not convert string to bytes: 'hugo'

>>> x = np.array([b'hugo'], 'S')
>>>

but with support for textarrays as Oscars showed, to make it easy to
convert between the 'S' and 'S:encoding' or use either view on the
memory.
I like the idea of an `encoding_view` on some 'S' bytes, and once we
have a view like that there is no reason to pretend 'S' bytes are
text.


>
>>
>> or follow whatever the convention of numpy is:
>>
>> >>> s = -256
>> >>> np.array((s,), dtype=np.uint8)[0] == s
>> False
>> >>> s = -1
>> >>> np.array((s,), dtype=np.uint8)[0] == s
>> False
>
>
> I  think text is distinct enough from  numbers that we don't need to do that
> same thing -- and this is result of well-defined casting rules built into
> the compiler (and hardware?) for the numeric types. I dont hink we have
> either the standard or compiler support for text conversions like that.
>
> -CHB
>
> PS: this is interesting, on py2:
>
>
> In [176]: a = np.array((2222,), dtype='S')
>
> In [177]: a
> Out[177]:
> array(['2'],
>       dtype='|S1')
>
> It converts it to a string, but only grabs the first character? (is it
> determining the size before converting to a string?

I recently fixed a bug in statsmodels based on this. I don't know why
the code worked before, I assume it used string integers instead of
integers at some point when it was written

>
> and this:
>
> In [182]: a = np.array(2222, dtype='S')
>
> In [183]: a
> Out[183]:
> array('2222',
>       dtype='|S24')
>
> 24 ? where did that come from?

No idea.

Unless I missed something when I didn't pay attention, there never
before was any discussion on the mailing list about bytes versus
strings in python 3 in numpy (I don't follow numpy's "issues").
And I neither remember (m)any public complaints about the behavior of
the 'S' type in strange cases.

maybe I didn't pay attention because I didn't care, until we ran into
the python 3 problems. maybe nobody else did either.


Josef

>
>
>
>
>
>
>
>
>
>
>
>>
>>
>> Josef
>>
>> >
>> > Key here is that we don't  have the option of not breaking anything,
>> > because
>> > there is a lot already broken.
>> >
>> > -Chris
>> >
>> >
>> > --
>> >
>> > Christopher Barker, Ph.D.
>> > Oceanographer
>> >
>> > Emergency Response Division
>> > NOAA/NOS/OR&R            (206) 526-6959   voice
>> > 7600 Sand Point Way NE   (206) 526-6329   fax
>> > Seattle, WA  98115       (206) 526-6317   main reception
>> >
>> > Chris.Barker at noaa.gov
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list