[Numpy-discussion] Array and string interoperability

Tue Jun 6 18:05:58 EDT 2017

On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas at gmail.com> wrote:

> -- classify by "forward/backward" conversion:
>     For this time consider only forward, i.e. I copy data from string
> to numpy array
>
> -- classify by " bytes  vs  ordinals ":
>
> a)  bytes:  If I need raw bytes - in this case e.g.
>
>   B = bytes(s.encode())
>

no need to call "bytes" -- encode() returns a bytes object:

In [1]: s = "this is a simple ascii-only string"

In [2]: b = s.encode()

In [3]: type(b)

Out[3]: bytes

In [4]: b

Out[4]: b'this is a simple ascii-only string'

>
> will do it. then I can copy data to array. So currently there are methods
> coverings this. If I understand correctly the data extracted corresponds
> to utf-??  byte feed, i.e. non-constant byte-length of chars (1 up to
> 4 bytes per char for
> the 'wide' unicode, correct me if I am wrong).
>

In [5]: s.encode?
Docstring:
S.encode(encoding='utf-8', errors='strict') -> bytes

So the default is utf-8, but you can set any encoding you want (that python
supports)

 In [6]: s.encode('utf-16')

Out[6]: b'\xff\xfet\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00
\x00s\x00i\x00m\x00p\x00l\x00e\x00
\x00a\x00s\x00c\x00i\x00i\x00-\x00o\x00n\x00l\x00y\x00
\x00s\x00t\x00r\x00i\x00n\x00g\x00'

> b):  I need *ordinals*
>   Yes, I need ordinals, so for the bytes() method, if a Python 3
> string contains only
>   basic ascii, I can so or so convert to bytes then to integer array
> and the length will
>   be the same 1byte for each char.
>   Although syntactically seen, and with slicing, this will look e.g. like:
>
> s= "012 abc"
> B = bytes(s.encode())  # convert to bytes
> k  = len(s)
> arr = np.zeros(k,"u1")   # init empty array length k
> arr[0:2] = list(B[0:2])
> print ("my array: ", arr)
> ->
> my array:  [48 49  0  0  0  0  0]
>

This can be done more cleanly:

In [15]: s= "012 abc"

In [16]: b = s.encode('ascii')

# you want to use the ascii encoding so you don't get utf-8 cruft if there
are non-ascii characters
#  you could use latin-1 too (Or any other one-byte per char encoding

In [17]: arr = np.fromstring(b, np.uint8)
# this is using fromstring() to means it's old py definiton - treat teh
contenst as bytes
# -- it really should be called "frombytes()"
# you could also use:

In [22]: np.frombuffer(b, dtype=np.uint8)
Out[22]: array([48, 49, 50, 32, 97, 98, 99], dtype=uint8)In [18]: print arr

In [19]: print(arr)
[48 49 50 32 97 98 99]

# you got the ordinals

In [20]: "".join([chr(i) for i in arr])
Out[20]: '012 abc'

# yes, they are the right ones...

> Result seems correct. Note that I also need to use list(B), otherwise
> the slicing does not work (fills both values with 1, no idea where 1
> comes from).
>

that is odd -- I can't explain it right now either...

> Or I can write e.g.:
> arr[0:2] = np.fromstring(B[0:2], "u1")
>
> But looks indeed like a 'hack' and not so simple.
>

is the above OK?

> -- classify "what is maximal ordinal value in the string"
> Well, say, I don't know what is maximal ordinal, e.g. here I take
> 3 Cyrillic letters instead of 'abc':
>
> s= "012 АБВ"
> k  = len(s)
> arr = np.zeros(k,"u4")   # init empty 32 bit array length k
> arr[:] = np.fromstring(np.array(s),"u4")
> ->
> [  48   49   50   32 1040 1041 1042]
>

so this is making a numpy string, which is a UCS-4 encoding unicode -- i.e.
4 bytes per charactor. Then you care converting that to an 4-byte unsigned
int. but no need to do it with fromstring:

In [52]: s
Out[52]: '012 АБВ'

In [53]: s_arr.reshape((1,)).view(np.uint32)
Out[53]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

we need the reshape() because .view does not work with array scalars -- not
sure why not?

> This gives correct results indeed. So I get my ordinals as expected.
> So this is better/preferred way, right?
>

I would maybe do it more "directly" -- i.e. use python's string to do the
encoding:

In [64]: s
Out[64]: '012 АБВ'

In [67]: np.fromstring(s.encode('U32'), dtype=np.uint32)
Out[67]: array([65279,    48,    49,    50,    32,  1040,  1041,  1042],
dtype=uint32)

that first value is the byte-order mark (I think...), you  can strip it off
with:

In [68]: np.fromstring(s.encode('U32')[4:], dtype=np.uint32)
Out[68]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

or, probably better simply specify the byte order in the encoding:

In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)
Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)

arr = np.ordinals(s)
> arr[0:2] = np.ordinals(s[0:2])  # with slicing
>
> or, e.g. in such format:
>
> arr = np.copystr(s)
> arr[0:2] = np.copystr(s[0:2])
>

I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding"
is pretty much the ordinals anyway.

As you notices, if you make a numpy unicode string array, and change the
dtype to unsigned int32, you get what you want.

You really don't want to mess with any of this unless you understand
unicode and encodings anyway....

Though it is a bit akward -- why is your actual use-case for working with
ordinals???

BTW, you can use regular python to get the ordinals first:

In [71]: np.array([ord(c) for c in s])
Out[71]: array([  48,   49,   50,   32, 1040, 1041, 1042])

Though for Python 2 could raise questions why need casting to "u4".
>

this would all work the same with python 2 if you used unicode objects
instead of strings. Maybe good to put:

from __future__ import unicode_literals

in your source....

> So approximately are my ideas.
> For me it would cover many application cases.

I'm still curious as to your use-cases -- when do you have a bunch of
ordinal values??

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170606/d0f16d75/attachment.html>