[Numpy-discussion] Array and string interoperability

Chris Barker chris.barker at noaa.gov
Mon Jun 5 13:40:33 EDT 2017


Just a few notes:

However, the fact that this works for bytestrings on Python 3 is, in my
> humble opinion, ridiculous:
>
> >>> np.array(b'100', 'u1') # b'100' IS NOT TEXT
> array(100, dtype=uint8)
>

Yes, that is a mis-feature -- I think due to bytes and string being the
same object in py2 -- so on py3, numpy continues to treat a bytes objects
as also a 1-byte-per-char string, depending on context. And users want to
be able to write numpy code that will run the same on py2 and py3, so we
kinda need this kind of thing.

Makes me think that an optional "pure-py-3" mode for numpy might be a good
idea. If that flag is set, your code will only run on py3 (or at least
might run differently).


> > Further thoughts:
> > If trying to create "u1" array from a Pyhton 3 string, question is,
> > whether it should throw an error, I think yes,


well, you can pass numbers > 255 into a u1 already:

In [*96*]: np.array(456, dtype='u1')

Out[*96*]: array(200, dtype=uint8)
and it does the wrap-around overflow thing... so why not?


> and in this case
> > "u4" type should be explicitly specified by initialisation, I suppose.
> > And e.g. translation from unicode to extended ascii (Latin1) or whatever
> > should be done on Python side  or with explicit translation.
>

absolutely!

If you ask me, passing a unicode string to fromstring with sep='' (i.e.
> to parse binary data) should ALWAYS raise an error: the semantics only
> make sense for strings of bytes.
>

exactly -- we really should have a "frombytes()" alias for fromstring() and
it should only work for atual bytes objects (strings on py2, naturally).

and overloading fromstring() to mean both "binary dump of data" and "parse
the text" due to whether the sep argument is set was always a bad idea :-(

.. and fromstring(s, sep=a_sep_char)

has been semi broken (or at least not robust) forever anyway.

Currently, there appears to be some UTF-8 conversion going on, which
> creates potentially unexpected results:
>
> >>> s = 'αβγδ'
> >>> a = np.fromstring(s, 'u1')
> >>> a
> array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
> >>> assert len(a) * a.dtype.itemsize  == len(s)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AssertionError
> >>>
>
> This is, apparently (https://github.com/numpy/numpy/issues/2152), due to
> how the internals of Python deal with unicode strings in C code, and not
> due to anything numpy is doing.
>

exactly -- py3 strings are pretty nifty implementation of unicode text --
they have nothing to do with storing binary data, and should not be used
that way. There is essentially no reason you would ever want to pass the
actual binary representation to any other code.

fromstring should be re-named frombytes, and it should raise an exception
if you pass something other than a bytes object (or maybe a memoryview or
other binary container?)

we might want to keep fromstring() for parsing strings, but only if it were
fixed...

IMHO calling fromstring(..., sep='') with a unicode string should be
> deprecated and perhaps eventually forbidden. (Or fixed, but that would
> break backwards compatibility)


agreed.

> Python3 assumes 4-byte strings but in reality most of the time
> > we deal with 1-byte strings, so there is huge waste of resources
> > when dealing with 4-bytes. For many serious projects it is just not
> needed.
>
> That's quite enough anglo-centrism, thank you. For when you need byte
> strings, Python 3 has a type for that. For when your strings contain
> text, bytes with no information on encoding are not enough.
>

There was a big thread about this recently -- it seems to have not quite
come to a conclusion. But anglo-centrism aside, there is substantial demand
for a "smaller" way to store mostly-ascii text.

I _think_ the conversation was steering toward an encoding-specified string
dtype, so us anglo-centric folks could use latin-1 or utf-8.

But someone would need to write the code.

-CHB

> There can be some convenience methods for ascii operations,
> > like eg char.toupper(), but currently they don't seem to work with
> integer
> > arrays so why not make those potentially useful methots usable
> > and make them work on normal integer arrays?
> I don't know what you're doing, but I don't think numpy is normally the
> right tool for text manipulation...
>

I agree here. But if one were to add such a thing (vectorized string
operations) -- I'd think the thing to do would be to wrap (or port) the
python string methods. But it shoudl only work for actual string dtypes, of
course.

note that another part of the discussion previously suggested that we have
a dtype that wraps a native python string object -- then you'd get all for
free. This is essentially an object array with strings in it, which you can
do now.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170605/3e24e1b5/attachment.html>


More information about the NumPy-Discussion mailing list