[Numpy-discussion] NEP 31 — Context-local and global overrides of the NumPy API

Nathaniel Smith njs at pobox.com
Sun Sep 8 21:26:43 EDT 2019

On Sun, Sep 8, 2019 at 8:40 AM Ralf Gommers <ralf.gommers at gmail.com> wrote:
> On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith <njs at pobox.com> wrote:
>> On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers <ralf.gommers at gmail.com> wrote:
>> > On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith <njs at pobox.com> wrote:
>> >> On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi <einstein.edison at gmail.com> wrote:
>> >> > The fact that we're having to design more and more protocols for a lot
>> >> > of very similar things is, to me, an indicator that we do have holistic
>> >> > problems that ought to be solved by a single protocol.
>> >>
>> >> But the reason we've had trouble designing these protocols is that
>> >> they're each different :-). If it was just a matter of copying
>> >> __array_ufunc__ we'd have been done in a few minutes...
>> >
>> > I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also.
>> Huh, that's interesting! Apparently we have a profoundly different
>> understanding of what we're doing here.
> That is interesting indeed. We should figure this out first - no point discussing a NEP about plugging the gaps in our override system when we don't have a common understanding of why we wanted/needed an override system in the first place.
>> To me, __array_ufunc__ and
>> __array_function__ are completely different. In fact I'd say
>> __array_ufunc__ is a good idea and __array_function__ is a bad idea,
> It's early days, but "customer feedback" certainly has been more enthusiastic for __array_function__. Also from what I've seen so far it works well. Example: at the SciPy sprints someone put together Xarray plus pydata/sparse to use distributed sparse arrays for visualizing some large genetic (I think) data sets. That was made to work in a single day, with impressively little code.

Yeah, it's true, and __array_function__ made a bunch of stuff that
used to be impossible become possible, I'm not saying it didn't. My
prediction is that the longer we live with it, the more limits we'll
hit and the more problems we'll have with long-term maintainability. I
don't think initial enthusiasm is a good predictor of that either way.

>> The key difference is that __array_ufunc__ allows for *generic*
>> implementations.
> Implementations of what?

Generic in the sense that you can write __array_ufunc__ once and have
it work for all ufuncs.

>> Most duck array libraries can write a single
>> implementation of __array_ufunc__ that works for *all* ufuncs, even
>> new third-party ufuncs that the duck array library has never heard of,
> I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used.

I mean, I just looked at dask and xarray, and they're both doing
exactly what I said, right now in shipping code. What use cases are
you targeting here if you consider dask and xarray out-of-scope? :-)

> this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__

Sparse arrays aren't very niche... and the isinstance trick is
possible in some cases, but (a) it's relying on an undocumented
implementation detail of __array_function__; according to
__array_function__'s API contract, you could just as easily get passed
the ufunc's __call__ method instead of the object itself, and (b) it
doesn't work at all for ufunc methods like reduce, outer, accumulate.
These are both show-stoppers IMO.

> This last point, using third-party ufuncs, is the interesting one to me. They have to be generated with the NumPy ufunc machinery, so the dispatch mechanism is attached to them. We could do third party functions for __array_function__ too, but that would require making @array_function_dispatch public, which we haven't done (yet?).

With __array_function__ it's theoretically possible to do the dispatch
on third-party functions, but when someone defines a new function they
always have to go update all the duck array libraries to hard-code in
some special knowledge of their new function. So in my example, even
if we made @array_function_dispatch public, you still couldn't use
your nice new numba-created gufunc unless you first convinced dask,
xarray, and bcolz to all accept patches to support your new gufunc.
With __array_ufunc__, it works out-of-the-box.

> But what is that road, and what do you think the goal is? To me it's: separate our API from our implementation. Yours seems to be "reuse our implementations" for __array_ufunc__, but I can't see how that generalizes beyond ufuncs.

The road is to define *abstractions* for the operations we expose
through our API, so that duck array implementors can work against a
contract with well-defined preconditions and postconditions, so they
can write code the works reliably even when the surrounding
environment changes. That's the only way to keep things maintainable
AFAICT. If the API contract is just a vague handwave at the numpy API,
then no-one knows which details actually matter, it's impossible to
test, implementations will inevitably end up with subtle long-standing
bugs, and literally any change in numpy could potentially break duck
array users, we don't know. So my motivation is that I like testing, I
don't like bugs, and I like being able to maintain things with
confidence :-). The principles are much more general than ufuncs;
that's just a pertinent example.

> I think this is an important point. GPUs are massively popular, and when very likely just continue to grow in importance. So anything we do in this space that says "well it works, just not for GPUs" is probably not going to solve our most pressing problems.

I'm not saying "__array_ufunc__ doesn't work for GPUs". I'm saying
that when it comes to GPUs, there's an upper bound for how good you
can hope to do, and __array_ufunc__ achieves that upper bound. So does
__array_function__. So if we only care about GPUs, they're about
equally good. But if we also care about dask and xarray and compressed
storage and sparse storage and ... then __array_ufunc__ is strictly
superior in those cases. So replacing __array_ufunc__ with
__array_function__ would be a major backwards step.


Nathaniel J. Smith -- https://vorpus.org

More information about the NumPy-Discussion mailing list