[Numpy-discussion] adding more unicode dtypes
chris.barker at noaa.gov
Wed Jan 15 15:07:35 EST 2014
Julian -- beat me to it!
On Wed, Jan 15, 2014 at 10:25 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:
> On 15.01.2014 18:57, Charles R Harris wrote:
> > There was a discussion of this long ago and UCS-4 was chosen as the
> > numpy standard. There are just too many complications that arise in
> > supporting both.
supporting both UCS-4 and UCS-2 would be more pain than it's worth.
> In python3 you need extra code to deal with arrays containing strings as
> the S type is interpreted as bytes which is not a string type anymore .
ouch! I was just assuming that it still was -- yes, I really think we need
a one-byte-per char string type -- probably ascii, but we could do latin-1
and let the buyer beware of the higher value bytes
Someone on irc (I think Freddie Witherden CC'd) had a use case with huge
> ascii tables in numpy which now have to be stored as 4 bytes unicode on
> disk or decode bytes all the time.
and ascii data is not the least bit rare in the science world in
> I personally don't use strings in arrays so I can neither judge the
> impact nor the use, but it seems to me like at least having an ascii
> dtype for python2<->python3 compatibility would be useful.
I think py2<->py3 compatibilty is a separate issue -- we should have this
if it's a good thing to have, not because of that. And it is a good thing
And since this is a new thread -- regardless of the decision on this,
loadtxt is broken -- we certainly should be able to parse ascii text and
return something reasonable -- unicode strings would have been fine in the
OPs case, if they didn't have the extra bytes to tring crap in them.
The transition towards split string/bytes types in Python 3 has the
unfortunate side effect of breaking the following snippet:
np.array("Hello", dtype="|S").item() == "Hello"
Sorry for not testing in py3, but this makes it look like the "S" dtype is
one-byte per char strings, but creates a bytes object, rather than a
unicode (py3 str) object.
As in my other note, I think it would be better to have it return a unicode
string by default.
But it looks like you can still use it to store large quantities of ascii
data if you want.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion