[Numpy-discussion] Array and string interoperability

Mikhail V mikhailwas at gmail.com
Sun Jun 4 14:04:15 EDT 2017


Array and string interoperability

Just sharing my thoughts and few ideas about simplification of casting
strings to arrays.
In examples assume Numpy is in the namespace (from numpy import *)

Initialize array from a string currently looks like:

s= "012 abc"
A= fromstring(s,"u1")
print A ->
[48 49 50 32 97 98 99]

Perfect.
Now when writing values it will not work
as IMO it should, namley consider this example:

B= zeros(7,"u1")
B[0]=s[1]
print B ->
[1 0 0 0 0 0 0]

Ugh? It tries to parse the s[1] character "1" as integer and writes 1 to B[0].
First thing I would expect is a value error and I'd never expect it does
that high-level manipulations with parsing.
IMO ideally it would do the following instead:

B[0]=s[1]
print B ->
[49  0  0  0  0  0  0]

So it should just write ord(s[1]) to B.
Sounds logical? For me very much.
Further, one could write like this:

B[:] = s
print B->
[48 49 50 32 97 98 99]

Namely cast the string into byte array. IMO this would be
the logical expected  behavior.
Currently it just throws the value error if met non-digits in a string,
so IMO current casting hardly can be of practical use.

Furthermore, I think this code:

A= array(s,"u1")

Could act exactly same as:

A= fromstring(s,"u1")

But this is just a side-idea for spelling simplicty/generality.
Not really necessary.

Further thoughts:
If trying to create "u1" array from a Pyhton 3 string, question is,
whether it should throw an error, I think yes, and in this case
"u4" type should be explicitly specified by initialisation, I suppose.
And e.g. translation from unicode to extended ascii (Latin1) or whatever
should be done on Python side  or with explicit translation.

Python3 assumes 4-byte strings but in reality most of the time
we deal with 1-byte strings, so there is huge waste of resources
when dealing with 4-bytes. For many serious projects it is just not needed.

Furthermore I think some of the methods from "chararray" submodule
should be possible to use directly on normal integer arrays without
conversions to other array types.
So I personally don't realy get why the need of additional chararray type,
Its all numbers anyway and it's up to the programmer to
decide what size of translation tables/value ranges he wants to use.

There can be some convinience methods for ascii operations,
like eg char.toupper(), but currently they don't seem to work with integer
arrays so why not make those potentially useful methots usable
and make them work on normal integer arrays?
Or even migrate them to the root namespace to e.g. introduce
names with prefixes:

A=ascii_toupper(A)
A=ascii_tolower(A)

Many things can be be achieved with general numeric methods,
e.g. translate/reduce the array. Here obviosly I mean not dynamical
arrays, just fixed-sized arrays. How to deal with dynamically
changing array sizes is another problematic, and it depends on how the
software is designed in the first place and what it does with the data.

For my own text-editing software project I consider fixed allocated 1D
and 2D "uint8"
arrays only. And specifically I experiment with own encodings, so just
as a side-note, I don't think that encoding should be assumed much for
creating new array types, it is up to the programmer
to decide what 'meanings' the bytes have.


Kind regards,
Mikhail


More information about the NumPy-Discussion mailing list