
Currently. working with strings in numpy is not very convenient. You have to use a separate set of functions in a separate namespace, and those functions are relatively limited and poorly-documented. A solution several other projects, including pandas [0] and xarray [1], have found are string accessor methods. These are a set of methods attached to a `str` attribute of the class. These have the advantage that they are always available and have a well-defined object they operate on. On non-str dtypes, it would raise an exception. This would also provide a standardized set of methods and behaviors that are part of the numpy api that other classes could depend on. An example would be something like this:
[0] https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-met... [1] https://xarray.pydata.org/en/stable/generated/xarray.core.accessor_str.Strin...

The are in np.char mystr = np.array(["test first", "test second", "test third"]) np.char.title(mystr) array(['Test First', 'Test Second', 'Test Third'], dtype='<U11') -- Sent from: http://numpy-discussion.10968.n7.nabble.com/

On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <dan_patterson@outlook.com> wrote:
I mentioned those in my email, but they are far less convenient to use than class methods, nor do they relate well to how built-in strings are used in Python. That is why other projects have started using accessor methods and why Python removed all the separate string functions in Python 3. The functions in np.char are also limited in their capabilities, and fairly poorly documented in my opinion. Some of those limitations are impossible to overcome, for example they inherently can never support operators, addition or multiplication, or slicing like Python strings can, while an accessor could. However, putting them as top-level methods for ndarray would pollute the methods too much. That is why I am suggesting numpy do the same thing that pandas, xarray, etc. are doing and putting those as methods under a 'str' attribute for ndarrays rather than as separate functions.

I think that and string functions that are exposed from an ndarray would have to be guaranteed to work in-place. Requiring casting to objects to use the methods feels more like syntactic sugar than an essential case. I think most of the ones mentioned are low performance and can't take advantage of the storage as a blob of int8 (ascii) or int32 (utf32) that underlay Numpy string arrays. I also think the existence of these in pandas reduces the case for them being in Numpy. On Sun, Mar 7, 2021, 05:32 Todd <toddrjen@gmail.com> wrote:

On Sun, 2021-03-07 at 09:34 +0000, Kevin Sheppard wrote:
I agree with this, the need seems much lower in NumPy. And NumPy's currently somewhat weird strings at least for me makes it even less appealing to expose more string utilities of any kind at this time. In general, there is probably something to be said about such "accessor", in the sense of having a place to put methods which are specific to the array's dtype. Other examples are datetime/timedelta or Units and probably many potential DTypes [1]. It is one advantage that the `astropy.units.Quantity` subclass has over a DType based solution: `methods` can be added very transparently. Basically: The current `np.char` functions are a bit weird and I would need a quite a bit more convincing to expose them at this time. But, I would be delighted if we can think of a solution that goes beyond `str` [2]. Probably not adding `ndarray.str` at all or only if the array has a string DType. But do it in way that generalizes! That could be a DType specific mixin class, or I had previously played with the thought of a "generic" accessor: `ndarray.elementwise.<ufuncs-provided-by-DType>` But those go beyond the original string request and need some smart idea/thoughts! An interesting aside is that `arr.imag` and `arr.real` fall into the same category. But they are narrow enough that we can just have a specific solution for them. Cheers, Sebastian [1] Datetimes/timedelta might have some use of basic timezone handling (not sure if relevant to NumPy's naive datetimes). `astropy.units.Quantity` has a few extra methods/properties: * `.cgs`, `.si`, `.decompose()`, `.to()`: cast to different unit. * `.unit` * `.value`: get a value array view without any unit. * `.to_value()` method that returns a copy, not a view. Of course we can spell those using DTypes, but I think it might be long: `arr.astype(arr.dtype.cgs)`, or `arr.view(arr.dtype.unitless)`. Utility functions similar to `np.char` also can simplify all of this, but methods do have merit. Other user DTypes could very well have more compelling use-cases. [2] But it probably won't reach my serious thinking cycles for a while. For starters, dedicated utility functions seem decent enough...

The are in np.char mystr = np.array(["test first", "test second", "test third"]) np.char.title(mystr) array(['Test First', 'Test Second', 'Test Third'], dtype='<U11') -- Sent from: http://numpy-discussion.10968.n7.nabble.com/

On Sat, Mar 6, 2021 at 12:57 PM dan_patterson <dan_patterson@outlook.com> wrote:
I mentioned those in my email, but they are far less convenient to use than class methods, nor do they relate well to how built-in strings are used in Python. That is why other projects have started using accessor methods and why Python removed all the separate string functions in Python 3. The functions in np.char are also limited in their capabilities, and fairly poorly documented in my opinion. Some of those limitations are impossible to overcome, for example they inherently can never support operators, addition or multiplication, or slicing like Python strings can, while an accessor could. However, putting them as top-level methods for ndarray would pollute the methods too much. That is why I am suggesting numpy do the same thing that pandas, xarray, etc. are doing and putting those as methods under a 'str' attribute for ndarrays rather than as separate functions.

I think that and string functions that are exposed from an ndarray would have to be guaranteed to work in-place. Requiring casting to objects to use the methods feels more like syntactic sugar than an essential case. I think most of the ones mentioned are low performance and can't take advantage of the storage as a blob of int8 (ascii) or int32 (utf32) that underlay Numpy string arrays. I also think the existence of these in pandas reduces the case for them being in Numpy. On Sun, Mar 7, 2021, 05:32 Todd <toddrjen@gmail.com> wrote:

On Sun, 2021-03-07 at 09:34 +0000, Kevin Sheppard wrote:
I agree with this, the need seems much lower in NumPy. And NumPy's currently somewhat weird strings at least for me makes it even less appealing to expose more string utilities of any kind at this time. In general, there is probably something to be said about such "accessor", in the sense of having a place to put methods which are specific to the array's dtype. Other examples are datetime/timedelta or Units and probably many potential DTypes [1]. It is one advantage that the `astropy.units.Quantity` subclass has over a DType based solution: `methods` can be added very transparently. Basically: The current `np.char` functions are a bit weird and I would need a quite a bit more convincing to expose them at this time. But, I would be delighted if we can think of a solution that goes beyond `str` [2]. Probably not adding `ndarray.str` at all or only if the array has a string DType. But do it in way that generalizes! That could be a DType specific mixin class, or I had previously played with the thought of a "generic" accessor: `ndarray.elementwise.<ufuncs-provided-by-DType>` But those go beyond the original string request and need some smart idea/thoughts! An interesting aside is that `arr.imag` and `arr.real` fall into the same category. But they are narrow enough that we can just have a specific solution for them. Cheers, Sebastian [1] Datetimes/timedelta might have some use of basic timezone handling (not sure if relevant to NumPy's naive datetimes). `astropy.units.Quantity` has a few extra methods/properties: * `.cgs`, `.si`, `.decompose()`, `.to()`: cast to different unit. * `.unit` * `.value`: get a value array view without any unit. * `.to_value()` method that returns a copy, not a view. Of course we can spell those using DTypes, but I think it might be long: `arr.astype(arr.dtype.cgs)`, or `arr.view(arr.dtype.unitless)`. Utility functions similar to `np.char` also can simplify all of this, but methods do have merit. Other user DTypes could very well have more compelling use-cases. [2] But it probably won't reach my serious thinking cycles for a while. For starters, dedicated utility functions seem decent enough...
participants (4)
-
dan_patterson
-
Kevin Sheppard
-
Sebastian Berg
-
Todd