[Numpy-discussion] Array and string interoperability

Thomas Jollans tjol at tjol.eu
Sun Jun 4 17:59:50 EDT 2017


On 04/06/17 20:04, Mikhail V wrote:
> Initialize array from a string currently looks like:
>
> s= "012 abc"
> A= fromstring(s,"u1")
> print A ->
> [48 49 50 32 97 98 99]
>
> Perfect.
> Now when writing values it will not work
> as IMO it should, namley consider this example:
>
> B= zeros(7,"u1")
> B[0]=s[1]
> print B ->
> [1 0 0 0 0 0 0]
>
> Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0].
> First thing I would expect is a value error and I'd never expect it does
> that high-level manipulations with parsing.
> IMO ideally it would do the following instead:
>
> B[0]=s[1]
> print B ->
> [49  0  0  0  0  0  0]
>
> So it should just write ord(s[1]) to B.
> Sounds logical? For me very much.
> Further, one could write like this:
>
> B[:] = s
> print B->
> [48 49 50 32 97 98 99]
>
> Namely cast the string into byte array. IMO this would be
> the logical expected  behavior.
I disagree. If numpy treated bytestrings as sequences of uint8s (which
would, granted, be perfectly reasonable, at least in py3), you wouldn't
have needed the fromstring function in the first place. Personally, I
think I would prefer this, actually. However, numpy normally treats
strings as objects that can sometimes be cast to numbers, so this
behaviour is perfectly logical.

For what it's worth, in Python 3 (which you probably should want to be
using), everything behaves as you'd expect:

>>> import numpy as np
>>> s = b'012 abc'
>>> a = np.fromstring(s, 'u1')
>>> a
array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)
>>> b = np.zeros(7, 'u1')
>>> b[0] = s[1]
>>> b
array([49,  0,  0,  0,  0,  0,  0], dtype=uint8)
>>>

> Currently it just throws the value error if met non-digits in a string,
> so IMO current casting hardly can be of practical use.
>
> Furthermore, I think this code:
>
> A= array(s,"u1")
>
> Could act exactly same as:
>
> A= fromstring(s,"u1")
>
> But this is just a side-idea for spelling simplicty/generality.
> Not really necessary.

There is also something to be said for the current behaviour:

>>> np.array('100', 'u1')
array(100, dtype=uint8)

However, the fact that this works for bytestrings on Python 3 is, in my
humble opinion, ridiculous:

>>> np.array(b'100', 'u1') # b'100' IS NOT TEXT
array(100, dtype=uint8)

This is of course consistent with the fact that you can cast a
bytestring to builtin python int or float (but not complex).
Interestingly enough, numpy complex behaves differently from python complex:

>>> complex(b'1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: complex() argument must be a string or a number, not 'bytes'
>>> complex('1')
(1+0j)
>>> np.complex128('1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a float is required
>>>


> Further thoughts:
> If trying to create "u1" array from a Pyhton 3 string, question is,
> whether it should throw an error, I think yes, and in this case
> "u4" type should be explicitly specified by initialisation, I suppose.
> And e.g. translation from unicode to extended ascii (Latin1) or whatever
> should be done on Python side  or with explicit translation.

If you ask me, passing a unicode string to fromstring with sep='' (i.e.
to parse binary data) should ALWAYS raise an error: the semantics only
make sense for strings of bytes.

Currently, there appears to be some UTF-8 conversion going on, which
creates potentially unexpected results:

>>> s = 'αβγδ'
>>> a = np.fromstring(s, 'u1')
>>> a
array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8)
>>> assert len(a) * a.dtype.itemsize  == len(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>>

This is, apparently (https://github.com/numpy/numpy/issues/2152), due to
how the internals of Python deal with unicode strings in C code, and not
due to anything numpy is doing.

Speaking of unexpected results, I'm not sure you realize what fromstring
does when you give it a multi-byte dtype:

>>> s = 'αβγδ'
>>> a = np.fromstring(s, 'u4')
>>> a
array([2999890382, 3033445326], dtype=uint32)
>>>

Give fromstring() a numpy unicode string, and all is right with the world:

>>> s = np.array('αβγδ')
>>> s
array('αβγδ',
      dtype='<U4')
>>> np.fromstring(s, 'u4')
array([945, 946, 947, 948], dtype=uint32)
>>>


IMHO calling fromstring(..., sep='') with a unicode string should be
deprecated and perhaps eventually forbidden. (Or fixed, but that would
break backwards compatibility)

> Python3 assumes 4-byte strings but in reality most of the time
> we deal with 1-byte strings, so there is huge waste of resources
> when dealing with 4-bytes. For many serious projects it is just not needed.

That's quite enough anglo-centrism, thank you. For when you need byte
strings, Python 3 has a type for that. For when your strings contain
text, bytes with no information on encoding are not enough.

> Furthermore I think some of the methods from "chararray" submodule
> should be possible to use directly on normal integer arrays without
> conversions to other array types.
> So I personally don't realy get why the need of additional chararray type,
> Its all numbers anyway and it's up to the programmer to
> decide what size of translation tables/value ranges he wants to use.

chararray is deprecated.

> There can be some convinience methods for ascii operations,
> like eg char.toupper(), but currently they don't seem to work with integer
> arrays so why not make those potentially useful methots usable
> and make them work on normal integer arrays?
I don't know what you're doing, but I don't think numpy is normally the
right tool for text manipulation...

> [snip]
>
> as a side-note, I don't think that encoding should be assumed much for
> creating new array types, it is up to the programmer
> to decide what 'meanings' the bytes have.

Agreed!



-- Thomas



More information about the NumPy-Discussion mailing list