[Numpy-discussion] Array and string interoperability

Mikhail V mikhailwas at gmail.com
Tue Jun 6 23:47:17 EDT 2017


On 7 June 2017 at 00:05, Chris Barker <chris.barker at noaa.gov> wrote:
> On Mon, Jun 5, 2017 at 3:59 PM, Mikhail V <mikhailwas at gmail.com> wrote:

>> s= "012 abc"
>> B = bytes(s.encode())  # convert to bytes
>> k  = len(s)
>> arr = np.zeros(k,"u1")   # init empty array length k
>> arr[0:2] = list(B[0:2])
>> print ("my array: ", arr)
>> ->
>> my array:  [48 49  0  0  0  0  0]
>
>
> This can be done more cleanly:
>
> In [15]: s= "012 abc"
>
> In [16]: b = s.encode('ascii')
>
> # you want to use the ascii encoding so you don't get utf-8 cruft if there
> are non-ascii characters
> #  you could use latin-1 too (Or any other one-byte per char encoding

Thanks for clarifying, that makes sense.
Also it's a good way to validate the string.


>
> or, probably better simply specify the byte order in the encoding:
>
> In [69]: np.fromstring(s.encode('UTF-32LE'), dtype=np.uint32)
> Out[69]: array([  48,   49,   50,   32, 1040, 1041, 1042], dtype=uint32)


Ok, this gives what I want too.
So now for unicode I am by two possible options (apart from possible
"fromstring" spelling):
with indexing (if I want to copy into already existing array on the fly):

arr[0:3] = np.fromstring(np.array(s[0:3]),"u4")
arr[0:3] = np.fromstring(s[0:3].encode('UTF-32LE'),"u4")


>
>> arr = np.ordinals(s)
>> arr[0:2] = np.ordinals(s[0:2])  # with slicing
>
>
> I don't think any of this is necessary -- the UCS4 (Or UTF-32) "encoding" is
> pretty much the ordinals anyway.
>
> As you notices, if you make a numpy unicode string array, and change the
> dtype to unsigned int32, you get what you want.

No I am not implying anything is necessary, just seems to be sort of a pattern.
And from Python 3 perspective where string indexing is by wide characters ...
well I don't know.


>> Example integer array usage in context of textual data in my case:
>> - holding data in a text editor (mutability+indexing/slicing)
>>

>you really want to use regular old python data structures for that...
>[...]
>the numpy string type (unicode type) works with fixed length strings -- not
>characters, but you can reshape it and make a view:
>[...]

I am intentionally choosing fixed size array for holding data and
writing values using indexes.
But wait a moment, characters *are* integers, identities, [put some
other name here].

> In [93]: c_arr = arr.view(dtype = '<U1')
> In [97]: np.roll(c_arr, 3)
> Out[97]:
> array(['a', 'b', 'c', '0', '1', '2', ' '],
>     dtype='<U1')

So here it prints  ['a', 'b', 'c', '0', '1', '2', ' '] which
is the same data, it is just a matter of printing.

If we talk about methods available already in particular libs, then
well, yes they are set up to work on specific object types only.
But generally speaking, if I want to select e.g. specific character values,
or I am selecting specific values in some discrete sets...

But I  have no experience with numpy string types
and could not feel the real purposes yet.



-------
(Off topic here)


>> Foremost, it comes down to the question of defining this "optimal
>> 8-bit character table".
>> And "Latin-1", (exactly as it is)  is not that optimal table,
>
>there is no such thing as a single "optimal" set of characters when you are
>limited to 255 of them...

Yeah, depends much on criteria of 'optimality' and many other things ;)

>> But, granted, if define most accented letters as
>> "optional", i.e . delete them
>> then it is quite reasonable basic char table to start with.
>
>Then you are down to ASCII, no?

No, then I am down to ASCII plus few vital characters, e.g.:

- Dashes (which could solve the painful and old as world problem of
"hyphen" vs "minus")
- Multiplication sign, degree
- Em dash, quotation marks, spaces (non-breaking, half)   --  all
vital for typesetting
...

If you think about it,  255 units is more than enough to define
perfect communication standards.

>but anyway, I don't think a new encoding is really the topic at hand
>here....

Yes I think this is off-opic on this list. But intersting indeed,
where it is on-topic.
Seems like those encodings are coming from some "mysterios castle in
the clouds".


Mikhail


More information about the NumPy-Discussion mailing list