[Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 12:10 PM, <josef.pktd at gmail.com> wrote:
me too ;-)
> As far as I understand all codecs have the same ascii part.
nope -- certainly not multi-byte codecs. And one of the key points of utf-8
is that the ascii part is compatible -- none of teh other full-unicode
encoding are.
many of the one-byte-per-char ones do share the ascii part, but not all, or
not completely.
> cast on ascii and raise on anything else.
still a fine option -- clearly defined and quite useful for scientific
text. However, I would prefer latin-1 -- that way you might get garbage
for the non-ascii parts, but it wouldn't raise an exception and it
round-trips through encoding/decoding. And you would have a somewhat more
useful subset -- including the latin-language character and symbols like
the degree symbol, etc.
>
> >>> s = -256
> >>> np.array((s,), dtype=np.uint8)[0] == s
> False
> >>> s = -1
> >>> np.array((s,), dtype=np.uint8)[0] == s
> False
I think text is distinct enough from numbers that we don't need to do
that same thing -- and this is result of well-defined casting rules built
into the compiler (and hardware?) for the numeric types. I dont hink we
have either the standard or compiler support for text conversions like that.
PS: this is interesting, on py2:
In [176]: a = np.array((2222,), dtype='S')
In [177]: a
Out[177]:
array(['2'],
dtype='|S1')
It converts it to a string, but only grabs the first character? (is
it determining the size before converting to a string?
and this:
In [182]: a = np.array(2222, dtype='S')
In [183]: a
Out[183]:
array('2222',
dtype='|S24')
24 ? where did that come from?
Josef
