[Numpy-discussion] Array and string interoperability
tjol at tjol.eu
Mon Jun 5 16:51:52 EDT 2017
On 05/06/17 19:40, Chris Barker wrote:
> If you ask me, passing a unicode string to fromstring with sep=''
> to parse binary data) should ALWAYS raise an error: the semantics only
> make sense for strings of bytes.
> exactly -- we really should have a "frombytes()" alias for
> fromstring() and it should only work for atual bytes objects (strings
> on py2, naturally).
> and overloading fromstring() to mean both "binary dump of data" and
> "parse the text" due to whether the sep argument is set was always a
> bad idea :-(
> .. and fromstring(s, sep=a_sep_char)
As it happens, this is pretty much what stdlib bytearray does since 3.2
> has been semi broken (or at least not robust) forever anyway.
> Currently, there appears to be some UTF-8 conversion going on, which
> creates potentially unexpected results:
> >>> s = 'αβγδ'
> >>> a = np.fromstring(s, 'u1')
> >>> a
> array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
> >>> assert len(a) * a.dtype.itemsize == len(s)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> This is, apparently (https://github.com/numpy/numpy/issues/2152
> <https://github.com/numpy/numpy/issues/2152>), due to
> how the internals of Python deal with unicode strings in C code,
> and not
> due to anything numpy is doing.
> exactly -- py3 strings are pretty nifty implementation of unicode text
> -- they have nothing to do with storing binary data, and should not be
> used that way. There is essentially no reason you would ever want to
> pass the actual binary representation to any other code.
> fromstring should be re-named frombytes, and it should raise an
> exception if you pass something other than a bytes object (or maybe a
> memoryview or other binary container?)
> we might want to keep fromstring() for parsing strings, but only if it
> were fixed...
> IMHO calling fromstring(..., sep='') with a unicode string should be
> deprecated and perhaps eventually forbidden. (Or fixed, but that would
> break backwards compatibility)
> > Python3 assumes 4-byte strings but in reality most of the time
> > we deal with 1-byte strings, so there is huge waste of resources
> > when dealing with 4-bytes. For many serious projects it is just
> not needed.
> That's quite enough anglo-centrism, thank you. For when you need byte
> strings, Python 3 has a type for that. For when your strings contain
> text, bytes with no information on encoding are not enough.
> There was a big thread about this recently -- it seems to have not
> quite come to a conclusion. But anglo-centrism aside, there is
> substantial demand for a "smaller" way to store mostly-ascii text.
> I _think_ the conversation was steering toward an encoding-specified
> string dtype, so us anglo-centric folks could use latin-1 or utf-8.
> But someone would need to write the code.
> > There can be some convenience methods for ascii operations,
> > like eg char.toupper(), but currently they don't seem to work
> with integer
> > arrays so why not make those potentially useful methots usable
> > and make them work on normal integer arrays?
> I don't know what you're doing, but I don't think numpy is
> normally the
> right tool for text manipulation...
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port)
> the python string methods. But it shoudl only work for actual string
> dtypes, of course.
> note that another part of the discussion previously suggested that we
> have a dtype that wraps a native python string object -- then you'd
> get all for free. This is essentially an object array with strings in
> it, which you can do now.
> Christopher Barker, Ph.D.
> Emergency Response Division
> NOAA/NOS/OR&R (206) 526-6959 voice
> 7600 Sand Point Way NE (206) 526-6329 fax
> Seattle, WA 98115 (206) 526-6317 main reception
> Chris.Barker at noaa.gov <mailto:Chris.Barker at noaa.gov>
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
m ☎ +31 6 42630259
e ✉ tjol at tjol.eu
More information about the NumPy-Discussion