[Numpy-discussion] Array and string interoperability

Tue Jun 6 18:29:00 EDT 2017

On Mon, Jun 5, 2017 at 4:06 PM, Mikhail V <mikhailwas at gmail.com> wrote:

> Likely it was about some new string array type...

yes, it was.

> Obviously there is demand. Terror of unicode touches many aspects

> of programmers life.

I don't know that I'd call it Terror, but frankly, the fact that you need
up to 4 bytes for a single character is really not the big issues. Given
that computer memory has grown by literally orders of magnitude since
Unicode was introduced, I don't know why there is such a hang up about it.

But we're scientific programmers we like to be efficient !

> Foremost, it comes down to the question of defining this "optimal
> 8-bit character table".
> And "Latin-1", (exactly as it is)  is not that optimal table,

there is no such thing as a single "optimal" set of characters when you are
limited to 255 of them...

latin-1 is pretty darn good for the, well, latin-based languages....

> But, granted, if define most accented letters as
> "optional", i.e . delete them
> then it is quite reasonable basic char table to start with.
>

Then you are down to ASCII, no?

but anyway, I don't think a new encoding is really the topic at hand
here....

>> I don't know what you're doing, but I don't think numpy is normally the
> >> right tool for text manipulation...
> >
> >
> > I agree here. But if one were to add such a thing (vectorized string
> > operations) -- I'd think the thing to do would be to wrap (or port) the
> > python string methods. But it shoudl only work for actual string dtypes,
> of
> > course.
> >
> > note that another part of the discussion previously suggested that we
> have a
> > dtype that wraps a native python string object -- then you'd get all for
> > free. This is essentially an object array with strings in it, which you
> can
> > do now.
> >
>
> Well here I must admit I don't quite understand the whole idea of
> "numpy array of string type". How used? What is main bebefit/feature...?
>

here you go -- you can do this now:

In [74]: s_arr = np.array([s, "another string"], dtype=np.object)
In [75]:

In [75]: s_arr
Out[75]: array(['012 АБВ', 'another string'], dtype=object)

In [76]: s_arr.shape
Out[76]: (2,)

You now have an array with python string object in it -- thus access to all
the string functionality:

In [81]: s_arr[1] = s_arr[1].upper()
In [82]: s_arr
Out[82]: array(['012 АБВ', 'ANOTHER STRING'], dtype=object)

and the ability to have each string be a different length.

If numpy were to know that those were string objects, rather than arbitrary
python objects, it could do vectorized operations on them, etc.

You can do that now with numpy.vectorize, but it's pretty klunky.

In [87]: np_upper = np.vectorize(str.upper)
In [88]: np_upper(s_arr)

Out[88]:
array(['012 АБВ', 'ANOTHER STRING'],
      dtype='<U14')

> Example integer array usage in context of textual data in my case:
> - holding data in a text editor (mutability+indexing/slicing)
>

you really want to use regular old python data structures for that...

> - filtering, transformations (e.g. table translations, cryptography, etc.)
>

that may be something to do with ordinals and numpy -- but then you need to
work with ascii or latin-1 and uint8 dtypes, or full Unicode and uint32
dtype -- that's that.

String type array? Will this be a string array you describe:
>
> s= "012 abc"
> arr = np.array(s)
> print ("type ", arr.dtype)
> print ("shape ", arr.shape)
> print ("my array: ", arr)
> arr = np.roll(arr[0],2)
> print ("my array: ", arr)
> ->
> type  <U7
> shape  ()
> my array:  012 abc
> my array:  012 abc
>
>
> So what it does? What's up with shape?
>

shape is an empty tuple, meaning this is a numpy scalar, containing a
single string

type '<U7' means little endian, unicode, 7 characters

> e.g. here I wanted to 'roll' the string.
> How would I replace chars? or delete?
> What is the general idea behind?
>

the numpy string type (unicode type) works with fixed length strings -- not
characters, but you can reshape it and make a view:

In [89]: s= "012 abc"

In [90]: arr.shape = (1,)

In [91]: arr.shape
Out[91]: (1,)

In [93]: c_arr = arr.view(dtype = '<U1')

In [97]: np.roll(c_arr, 3)
Out[97]:
array(['a', 'b', 'c', '0', '1', '2', ' '],
      dtype='<U1')

You could also create it as a character array in the first place by
unpacking it into a list first:

In [98]: c_arr = np.array(list(s))

In [99]: c_arr
Out[99]:
array(['0', '1', '2', ' ', 'a', 'b', 'c'],
      dtype='<U1')

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170606/8c136b05/attachment-0001.html>