[Numpy-discussion] adding more unicode dtypes

Chris Barker chris.barker at noaa.gov
Wed Jan 15 15:07:35 EST 2014


Julian -- beat me to it!

On Wed, Jan 15, 2014 at 10:25 AM, Julian Taylor <
jtaylor.debian at googlemail.com> wrote:

> On 15.01.2014 18:57, Charles R Harris wrote:
> > There was a discussion of this long ago and UCS-4 was chosen as the
> > numpy standard. There are just too many complications that arise in
> > supporting both.
>

supporting both UCS-4 and UCS-2 would be more pain than it's worth.


> In python3 you need extra code to deal with arrays containing strings as
> the S type is interpreted as bytes which is not a string type anymore [0].
>

ouch! I was just assuming that it still was -- yes, I really think we need
a one-byte-per char string type -- probably ascii, but we could do latin-1
and let the buyer beware of the higher value bytes

Someone on irc (I think Freddie Witherden CC'd) had a use case with huge
> ascii tables in numpy which now have to be stored as 4 bytes unicode on
> disk or decode bytes all the time.
>

and ascii data is not the least bit rare in the science world in
particular.


> I personally don't use strings in arrays so I can neither judge the
> impact nor the use, but it seems to me like at least having an ascii
> dtype for python2<->python3 compatibility would be useful.
>

I think py2<->py3 compatibilty is a separate issue -- we should have this
if it's a good thing to have, not because of that. And it is a good thing
to have.

And since this is a new thread -- regardless of the decision on this,
loadtxt is broken -- we certainly should be able to parse ascii text and
return something reasonable -- unicode strings would have been fine in the
OPs case, if they didn't have the extra bytes to tring crap in them.


[0] https://github.com/numpy/numpy/issues/4162


from that:

The transition towards split string/bytes types in Python 3 has the
unfortunate side effect of breaking the following snippet:

np.array("Hello", dtype="|S").item() == "Hello"
Sorry for not testing in py3, but this makes it look like the "S" dtype is
one-byte per char strings, but creates a bytes object, rather than a
unicode (py3 str) object.

As in my other note, I think it would be better to have it return a unicode
string by default.

But it looks like you can still use it to store large quantities of ascii
data if you want.

-Chris






-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140115/b7d76944/attachment.html>


More information about the NumPy-Discussion mailing list