ufunc improvements [Was: Warnings in numpy.ma.test()]

On Wed, Mar 17, 2010 at 10:16 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Wed, Mar 17, 2010 at 7:39 PM, Darren Dale dsdale24@gmail.com wrote:
On Wed, Mar 17, 2010 at 8:22 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Wed, Mar 17, 2010 at 5:26 PM, Darren Dale dsdale24@gmail.com wrote:
On Wed, Mar 17, 2010 at 5:43 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Wed, Mar 17, 2010 at 3:13 PM, Darren Dale dsdale24@gmail.com wrote:
On Wed, Mar 17, 2010 at 4:48 PM, Pierre GM pgmdevlist@gmail.com wrote: > On Mar 17, 2010, at 8:19 AM, Darren Dale wrote: >> >> I started thinking about a third method called __input_prepare__ >> that >> would be called on the way into the ufunc, which would allow you >> to >> intercept the input and pass a somehow modified copy back to the >> ufunc. The total flow would be: >> >> 1) Call myufunc(x, y[, z]) >> 2) myufunc calls ?.__input_prepare__(myufunc, x, y), which >> returns >> x', >> y' (or simply passes through x,y by default) >> 3) myufunc creates the output array z (if not specified) and >> calls >> ?.__array_prepare__(z, (myufunc, x, y, ...)) >> 4) myufunc finally gets around to performing the calculation >> 5) myufunc calls ?.__array_wrap__(z, (myufunc, x, y, ...)) and >> returns >> the result to the caller >> >> Is this general enough for your use case? I haven't tried to >> think >> about how to change some global state at one point and change it >> back >> at another, that seems like a bad idea and difficult to support. > > > Sounds like a good plan. If we could find a way to merge the first > two > (__input_prepare__ and __array_prepare__), that'd be ideal.
I think it is better to keep them separate, so we don't have one method that is trying to do too much. It would be easier to explain in the documentation.
I may not have much time to look into this until after Monday. Is there a deadline we need to consider?
I don't think this should go into 2.0, I think it needs more thought.
Now that you mention it, I agree that it would be too rushed to try to get it in for 2.0. Concerning a later release, is there anything in particular that you think needs to be clarified or reconsidered?
And 2.0 already has significant code churn. Is there any reason beyond a big hassle not to set/restore the error state around all the ufunc calls in ma? Beyond that, the PEP that you pointed to looks interesting. Maybe some sort of decorator around ufunc calls could also be made to work.
I think the PEP is interesting, but it is languishing. There were some questions and criticisms on the mailing list that I do not think were satisfactorily addressed, and as far as I know the author of the PEP has not pursued the matter further. There was some interest on the python-dev mailing list in the numpy community's use case, but I think we need to consider what can be done now to meet the needs of ndarray subclasses. I don't see PEP 3124 happening in the near future.
What I am proposing is a simple extension to our existing framework to let subclasses hook into ufuncs and customize their behavior based on the context of the operation (using the __array_priority__ of the inputs and/or outputs, and the identity of the ufunc). The steps I listed allow customization at the critical steps: prepare the input, prepare the output, populate the output (currently no proposal for customization here), and finalize the output. The only additional step proposed is to prepare the input.
What bothers me here is the opposing desire to separate ufuncs from their ndarray dependency, having them operate on buffer objects instead. As I see it ufuncs would be split into layers, with a lower layer operating on buffer objects, and an upper layer tying them together with ndarrays where the "business" logic -- kinds, casting, etc -- resides. It is in that upper layer that what you are proposing would reside. Mind, I'm not sure that having matrices and masked arrays subclassing ndarray was the way to go, but given that they do one possible solution is to dump the whole mess onto the subtype with the highest priority. That subtype would then be responsible for casts and all the other stuff needed for the call and wrapping the result. There could be library routines to help with that. It seems to me that that would be the most general way to go. In that sense ndarrays themselves would just be another subtype with especially low priority.
I'm sorry, I didn't understand your point. What you described sounds identical to how things are currently done. What distinction are you making, aside from operating on the buffer object? How would adding a method to modify the input to a ufunc complicate the situation?
Just *one* function to rule them all and on the subtype dump it. No __array_wrap__, __input_prepare__, or __array_prepare__, just something like __handle_ufunc__. So it is similar but perhaps more radical. I'm proposing having the ufunc upper layer do nothing but decide which argument type will do all the rest of the work, casting, calling the low level ufunc base, providing buffers, wrapping, etc. Instead of pasting bits and pieces into the existing framework I would like to lay out a line of attack that ends up separating ufuncs into smaller pieces that provide low level routines that work on strided memory while leaving policy implementation to the subtype. There would need to be some default type (ndarray) when the functions are called on nested lists and scalars and I'm not sure of the best way to handle that.
I'm just sort of thinking out loud, don't take it too seriously.
Thanks for the clarification. I think I see how this could work: if ufuncs were callable instances of classes, __call__ would find the input with highest priority and pass itself and the input to that object's __handle_ufunc__. Now it is up to __handle_ufunc__ to determine whether and how to modify the input, call some method on the ufunc (like execute) to perform the buffer operation, then __handle_ufunc__ performs the cast, deals with metadata and returns the result.
I skipped a step: initializing the output buffer. Would that be rolled into the ufunc execution, or should it be possible for __handle_ufunc__ to access the initialized buffer before execution occurs(__array_prepare__)? I think it is important to be able to perform the cast and calculate metadata before ufunc execution. If an error occurs, an exception can be raised before the ufunc operates on the arrays, which can modifies the data in place.
Darren

I'd like to use this thread to discuss possible improvements to generalize numpys functions. Sorry for double posting, but we will have a hard time keeping track of discussion about how to improve functions to deal with subclasses if they are spread across threads talking about warnings in masked arrays or masked arrays not dealing well with trapz. There is an additional bit at the end that was not discussed elsewhere.
On Thu, Mar 18, 2010 at 8:14 AM, Darren Dale dsdale24@gmail.com wrote:
On Wed, Mar 17, 2010 at 10:16 PM, Charles R Harris charlesr.harris@gmail.com wrote:
Just *one* function to rule them all and on the subtype dump it. No __array_wrap__, __input_prepare__, or __array_prepare__, just something like __handle_ufunc__. So it is similar but perhaps more radical. I'm proposing having the ufunc upper layer do nothing but decide which argument type will do all the rest of the work, casting, calling the low level ufunc base, providing buffers, wrapping, etc. Instead of pasting bits and pieces into the existing framework I would like to lay out a line of attack that ends up separating ufuncs into smaller pieces that provide low level routines that work on strided memory while leaving policy implementation to the subtype. There would need to be some default type (ndarray) when the functions are called on nested lists and scalars and I'm not sure of the best way to handle that.
I'm just sort of thinking out loud, don't take it too seriously.
Thanks for the clarification. I think I see how this could work: if ufuncs were callable instances of classes, __call__ would find the input with highest priority and pass itself and the input to that object's __handle_ufunc__. Now it is up to __handle_ufunc__ to determine whether and how to modify the input, call some method on the ufunc (like execute) to perform the buffer operation, then __handle_ufunc__ performs the cast, deals with metadata and returns the result.
I skipped a step: initializing the output buffer. Would that be rolled into the ufunc execution, or should it be possible for __handle_ufunc__ to access the initialized buffer before execution occurs(__array_prepare__)? I think it is important to be able to perform the cast and calculate metadata before ufunc execution. If an error occurs, an exception can be raised before the ufunc operates on the arrays, which can modifies the data in place.
We discussed the possibility of simplifying the wrapping scheme with a method like __handle_gfunc__. (I don't think this necessarily has to be limited to ufuncs.) I think a second method like __prepare_input__ is also necessary. Imagine something like:
class GenericFunction: @property def executable(self): return self._executable def __init__(self, executable): self._executable = executable def __call__(self, *args, **kwargs): # find the input with highest priority, and then: args, kwargs = input.__prepare_input__(self, *args, **kwargs) return input.__handle_gfunc__(self, *args, **kwargs)
# this is the core function to be passed to the generic class: def _add(a, b, out=None): # the generic, ndarray implementation. ...
# here is the publicly exposed interface: add = GenericFunction(_add)
# now my subclasses class MyArray(ndarray): # My class tweaks the execution of the function in __handle_gfunc__ def __prepare_input__(self, gfunc, *args, **kwargs): return mod_input[gfunc](*args, **kwargs) def __handle_gfunc__(self, gfunc, *args, **kwargs): res = gfunc.executable(*args, **kwargs) # you could have called a different core func there return mod_output[gfunc](res, *args, **kwargs)
class MyNextArray(MyArray): def __prepare_input__(self, gfunc, *args, **kwargs): # let the superclass do its thing: args, kwargs = MyArray.__prepare_input__(self, gfunc, *args, **kwargs) # now I can tweak it further: return mod_input_further[gfunc](*args, **kwargs) def __handle_gfunc__(self, gfunc, *args, **kwargs): # let's defer to the superclass to handle calling the core function: res = MyArray.__handle_gfunc__(self, gfunc, *args, **kwargs) # and now we have one more crack at the result before passing it back: return mod_output_further[gfunc](res, *args, **kwargs)
If a gfunc is not recognized, the subclass might raise a NotImplementedError or it might just pass the original args, kwargs on through. I didn't write that part out because the example was already running long. But the point is that a single entry point could be used for any subclass, without having to worry about how to support every subclass. It may still be necessary to be mindful to use asanyarray in the core functions, but if a subclass alters the behavior of some operation such that an operation needs to happen on an ndarray view of the data, __prepare_input__ provides an opportinuty to prepare such views. For example, in our current situation, matrices would not be compatible with trapz if trapz did not cast the input to ndarrays, but as a result trapz is not compatible with masked arrays or quantities. With the proposed scheme, matrices would in some cases pass ndarray views to the core function, but in other cases pass the arguments through unmodified, since the function might build on other functions that are already generalized to support those types of data.
Darren
participants (1)
-
Darren Dale