[Numpy-discussion] Array and string interoperability

Mikhail V mikhailwas at gmail.com
Mon Jun 5 19:06:34 EDT 2017

On 5 June 2017 at 19:40, Chris Barker <chris.barker at noaa.gov> wrote:
>> > Python3 assumes 4-byte strings but in reality most of the time
>> > we deal with 1-byte strings, so there is huge waste of resources
>> > when dealing with 4-bytes. For many serious projects it is just not
>> > needed.
>> That's quite enough anglo-centrism, thank you. For when you need byte
>> strings, Python 3 has a type for that. For when your strings contain
>> text, bytes with no information on encoding are not enough.
> There was a big thread about this recently -- it seems to have not quite
> come to a conclusion.

I have started to read that thread, though I've lost in idea transitions.
Likely it was about some new string array type...

> But anglo-centrism aside, there is substantial demand
> for a "smaller" way to store mostly-ascii text.

Obviously there is demand. Terror of unicode touches many aspects
of programmers life. It is not Numpy's problem though.
The realistic scenario for satisfaction for this demand is a hard and
wide problem.
Foremost, it comes down to the question of defining this "optimal
8-bit character table".
And "Latin-1", (exactly as it is)  is not that optimal table, at least
because of huge amount of
accented letters. But, granted, if define most accented letters as
"optional", i.e . delete them
then it is quite reasonable basic char table to start with.
Further comes the question of popularizisng new table (which doesn't
even exists yet).

>> > There can be some convenience methods for ascii operations,
>> > like eg char.toupper(), but currently they don't seem to work with
>> > integer
>> > arrays so why not make those potentially useful methots usable
>> > and make them work on normal integer arrays?
>> I don't know what you're doing, but I don't think numpy is normally the
>> right tool for text manipulation...
> I agree here. But if one were to add such a thing (vectorized string
> operations) -- I'd think the thing to do would be to wrap (or port) the
> python string methods. But it shoudl only work for actual string dtypes, of
> course.
> note that another part of the discussion previously suggested that we have a
> dtype that wraps a native python string object -- then you'd get all for
> free. This is essentially an object array with strings in it, which you can
> do now.

Well here I must admit I don't quite understand the whole idea of
"numpy array of string type". How used? What is main bebefit/feature...?

Example integer array usage in context of textual data in my case:
- holding data in a text editor (mutability+indexing/slicing)
- filtering, transformations (e.g. table translations, cryptography, etc.)

String type array? Will this be a string array you describe:

s= "012 abc"
arr = np.array(s)
print ("type ", arr.dtype)
print ("shape ", arr.shape)
print ("my array: ", arr)
arr = np.roll(arr[0],2)
print ("my array: ", arr)
type  <U7
shape  ()
my array:  012 abc
my array:  012 abc

So what it does? What's up with shape?
e.g. here I wanted to 'roll' the string.
How would I replace chars? or delete?
What is the general idea behind?


More information about the NumPy-Discussion mailing list