[Numpy-discussion] Array and string interoperability

Mikhail V mikhailwas at gmail.com
Mon Jun 5 18:59:27 EDT 2017

On 4 June 2017 at 23:59, Thomas Jollans <tjol at tjol.eu> wrote:
> For what it's worth, in Python 3 (which you probably should want to be
> using), everything behaves as you'd expect:
>>>> import numpy as np
>>>> s = b'012 abc'
>>>> a = np.fromstring(s, 'u1')
>>>> a
> array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)
>>>> b = np.zeros(7, 'u1')
>>>> b[0] = s[1]
>>>> b
> array([49,  0,  0,  0,  0,  0,  0], dtype=uint8)

Ok, examples do best.
I think we have to separate cases though.
So I will do examples in recent Python 3 now to avoid confusion.
Case divisions:

-- classify by "forward/backward" conversion:
    For this time consider only forward, i.e. I copy data from string
to numpy array

-- classify by " bytes  vs  ordinals ":

a)  bytes:  If I need raw bytes - in this case e.g.

  B = bytes(s.encode())

will do it. then I can copy data to array. So currently there are methods
coverings this. If I understand correctly the data extracted corresponds
to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to
4 bytes per char for
the 'wide' unicode, correct me if I am wrong).

b):  I need *ordinals*
  Yes, I need ordinals, so for the bytes() method, if a Python 3
string contains only
  basic ascii, I can so or so convert to bytes then to integer array
and the length will
  be the same 1byte for each char.
  Although syntactically seen, and with slicing, this will look e.g. like:

s= "012 abc"
B = bytes(s.encode())  # convert to bytes
k  = len(s)
arr = np.zeros(k,"u1")   # init empty array length k
arr[0:2] = list(B[0:2])
print ("my array: ", arr)
my array:  [48 49  0  0  0  0  0]

Result seems correct. Note that I also need to use list(B), otherwise
the slicing does not work (fills both values with 1, no idea where 1
comes from).
Or I can write e.g.:
arr[0:2] = np.fromstring(B[0:2], "u1")

But looks indeed like a 'hack' and not so sinple.
Considering your other examples there is other (better?) way, see below.
Note, I personally don't know best practices and many technical nuances
here so I repeat it from your words.

-- classify "what is maximal ordinal value in the string"
Well, say, I don't know what is maximal ordinal, e.g. here I take
3 Cyrillic letters instead of 'abc':

s= "012 АБВ"
k  = len(s)
arr = np.zeros(k,"u4")   # init empty 32 bit array length k
arr[:] = np.fromstring(np.array(s),"u4")
[  48   49   50   32 1040 1041 1042]

This gives correct results indeed. So I get my ordinals as expected.
So this is better/preferred way, right?

Just some further thoughts on the topic:
I would want to do the above things, in simpler syntax.
For example, if there would be methods taking Python strings:

arr = np.ordinals(s)
arr[0:2] = np.ordinals(s[0:2])  # with slicing

or, e.g. in such format:

arr = np.copystr(s)
arr[0:2] = np.copystr(s[0:2])

Which would give me same result as your proposed :

arr = np.fromstring(np.array(s),"u4")
arr[0:2] = np.fromstring(np.array(s[0:2]),"u4")

IOW omitting "u4" parameter seems to be OK. E.g.
if on the left side of assignment is "u1" array the values would be
silently wrapped(?) according to Numpy rules (as Chris pointed out).
And in similar way backward conversion to Python string.

Though for Python 2 could raise questions why need casting to "u4".

Would be cool just to use = without any methods as I've originally supposed,
but as I understand now this behaviour is already occupied and would cause
backward compatibility issues if touched.

So approximately are my ideas.
For me it would cover many applicaton cases.


More information about the NumPy-Discussion mailing list