[Numpy-discussion] Array and string interoperability

Thomas Jollans tjol at tjol.eu
Mon Jun 5 16:51:52 EDT 2017


On 05/06/17 19:40, Chris Barker wrote:
>
>     If you ask me, passing a unicode string to fromstring with sep=''
>     (i.e.
>     to parse binary data) should ALWAYS raise an error: the semantics only
>     make sense for strings of bytes.
>
>
> exactly -- we really should have a "frombytes()" alias for
> fromstring() and it should only work for atual bytes objects (strings
> on py2, naturally).
>
> and overloading fromstring() to mean both "binary dump of data" and
> "parse the text" due to whether the sep argument is set was always a
> bad idea :-(
>
> .. and fromstring(s, sep=a_sep_char)

As it happens, this is pretty much what stdlib bytearray does since 3.2
(http://bugs.python.org/issue8990)


>  
> has been semi broken (or at least not robust) forever anyway.
>
>     Currently, there appears to be some UTF-8 conversion going on, which
>     creates potentially unexpected results:
>
>     >>> s = 'αβγδ'
>     >>> a = np.fromstring(s, 'u1')
>     >>> a
>     array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
>     >>> assert len(a) * a.dtype.itemsize  == len(s)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     AssertionError
>     >>>
>
>     This is, apparently (https://github.com/numpy/numpy/issues/2152
>     <https://github.com/numpy/numpy/issues/2152>), due to
>     how the internals of Python deal with unicode strings in C code,
>     and not
>     due to anything numpy is doing.
>
>
> exactly -- py3 strings are pretty nifty implementation of unicode text
> -- they have nothing to do with storing binary data, and should not be
> used that way. There is essentially no reason you would ever want to
> pass the actual binary representation to any other code.
>
> fromstring should be re-named frombytes, and it should raise an
> exception if you pass something other than a bytes object (or maybe a
> memoryview or other binary container?)
>
> we might want to keep fromstring() for parsing strings, but only if it
> were fixed...
>
>     IMHO calling fromstring(..., sep='') with a unicode string should be
>     deprecated and perhaps eventually forbidden. (Or fixed, but that would
>     break backwards compatibility)
>
>
> agreed.
>
>     > Python3 assumes 4-byte strings but in reality most of the time
>     > we deal with 1-byte strings, so there is huge waste of resources
>     > when dealing with 4-bytes. For many serious projects it is just
>     not needed.
>
>     That's quite enough anglo-centrism, thank you. For when you need byte
>     strings, Python 3 has a type for that. For when your strings contain
>     text, bytes with no information on encoding are not enough.
>
>
> There was a big thread about this recently -- it seems to have not
> quite come to a conclusion. But anglo-centrism aside, there is
> substantial demand for a "smaller" way to store mostly-ascii text.
>
> I _think_ the conversation was steering toward an encoding-specified
> string dtype, so us anglo-centric folks could use latin-1 or utf-8.
>
> But someone would need to write the code.
>
> -CHB
>
>     > There can be some convenience methods for ascii operations,
>     > like eg char.toupper(), but currently they don't seem to work
>     with integer
>     > arrays so why not make those potentially useful methots usable
>     > and make them work on normal integer arrays?
>     I don't know what you're doing, but I don't think numpy is
>     normally the
>     right tool for text manipulation...
>
>
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port)
> the python string methods. But it shoudl only work for actual string
> dtypes, of course.
>
> note that another part of the discussion previously suggested that we
> have a dtype that wraps a native python string object -- then you'd
> get all for free. This is essentially an object array with strings in
> it, which you can do now.
>
> -CHB
>
>
> -- 
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov <mailto:Chris.Barker at noaa.gov>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion


-- 
Thomas Jollans

m ☎ +31 6 42630259
e ✉ tjol at tjol.eu



More information about the NumPy-Discussion mailing list