
On Thu, Oct 29, 2020 at 6:09 PM Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Tue, 2020-10-27 at 17:15 -0600, Aaron Meurer wrote:
For ndindex (https://quansight.github.io/ndindex/), the biggest issue with the API is that to use an ndindex object to actually index an array, you have to use a[idx.raw] instead of a[idx]. This is because for NumPy arrays, you cannot allow custom objects to be indices. The exception is objects that define __index__, but this only works for integer indices. If __index__ returns anything other than an integer, you get an IndexError. This is annoying because it's easy to forget to do this when working with the ndindex API, and the error message from NumPy isn't informative about what went wrong unless you know to expect it.
I'd like to propose an API that would allow custom objects to define how they should be converted to a standard NumPy index, similar to __index__ but that supports all index types. I think there are two options here:
- Allow __index__ to return any index type, not just integers. This is the simplest because it reuses an existing API, and __index__ is the best possible name for this API. However, I'm not sure, but this may actually conflict with the text of PEP 357 (https://www.python.org/dev/peps/pep-0357/). Also, some other APIs use __index__ to check if something is an indexable integer, which wouldn't accept generic index. For example, elements of a slice can be any object that defines __index__.
Index converts to an integer (safely). There is an assumptions that the integer is good for indexing, but I the name shouldn't be taken to mean it is specific to indexing (even if that was the main motivation).
- Add a new __numpy_index__ API that works like
def __numpy_index__(self): return <tuple, integer, slice, newaxis, ellipsis, or integer or boolean array>
In NumPy, __getitem__ and __setitem__ on ndarray would first check if the input index type is one of the known types as it currently does, then it would try __index__, and if neither of those fails, it would call __numpy_index__(index) and use that.
Do you anticipate just:
arr[index]
or also:
arr[index1, index2]
I think both should work. If the second one doesn't work it would be surprising.
Would you expect pandas or array-like objects to support this as well?
Yes, it would probably be best for array-like to also work with the same API. I don't know much about Pandas. It seems like it already allows a lot of indexing stuff. Do Series/Dataframe already have such an API?
If we only do `arr[index]` might subclassing tuple be sufficient?
I guess that technically works, except now your objects have to act like a tuple, even if they represent something like a slice (Python does not allow subclassing slice). For ndindex I've tried to make a distinction between objects as representing indices and the built-in objects that happen to be used to represent those indices by default. So an ndindex.Tuple explicitly doesn't work like a Tuple, an ndindex.Integer doesn't work like an int, and so on. That way there is a clear distinction between ndindex operations and operations on the built-in types.
Do you have any thought on how this might play out with a potential `arr.oindex[...]`?
I think oindex[idx] would call the same API on idx. I'm not sure if it matters that it's oindex, since that's at a higher level.
Adding either to NumPy is probably fairly straight forward, although I prefer either not slow down every single indexing operation for an extremely niche use-case (which is likely possible) or timing that it is insignificant.
I'm not sure it would. The current cases would all be tried first. The only time the new protocol would be used is when the index type isn't one of the currently allowed types, which currently raises IndexError.
What might help me is understanding that `ndindex` itself better. Since it seems like asking to add a protocol that may very well be used by only this one project?
That's fair. Maybe the more general API would make more sense then? I think it would need more thinking out, but it would allow a lot more use-cases. Aaron Meurer
Note: there is a more general way that NumPy arrays could allow __getitem__ to be defined on custom objects, which I am NOT proposing. Instead of an API that returns one of the current predefined index types (tuple, integer, slice, newaxis, ellipsis, or integer or boolean array), there could instead be an API that takes the array as input and returns another array (or view) as an output. This would allow an object to define itself as an index in arbitrary ways, even if such an index would not actually be possible via traditional indexing. There are definitely some interesting ideas that could be done with this, but this idea would be much more complicated, and isn't something that I need. Unless the community feels that a more general API like this would be preferred, I would suggest deferring something like it to a later discussion.
What would be the best way to go about getting something like this implemented? Is it simple enough that we can just work out the details here and on a pull request, or should I write a NEP?
A short NEP may make sense, at least if this is supposed to be a generic protocol for general array-likes, which I guess it would have to be ready for.
Cheers,
Sebastian
Aaron Meurer _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion