[Numpy-discussion] NEP 31 — Context-local and global overrides of the NumPy API

Mon Sep 9 21:27:34 EDT 2019

On Sun, Sep 8, 2019 at 6:27 PM Nathaniel Smith <njs at pobox.com> wrote:

> On Sun, Sep 8, 2019 at 8:40 AM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >
> >
> >
> > On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith <njs at pobox.com> wrote:
> >>
> >> On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >> > On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith <njs at pobox.com>
> wrote:
> >> >> On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi <
> einstein.edison at gmail.com> wrote:
> >> >> > The fact that we're having to design more and more protocols for a
> lot
> >> >> > of very similar things is, to me, an indicator that we do have
> holistic
> >> >> > problems that ought to be solved by a single protocol.
> >> >>
> >> >> But the reason we've had trouble designing these protocols is that
> >> >> they're each different :-). If it was just a matter of copying
> >> >> __array_ufunc__ we'd have been done in a few minutes...
> >> >
> >> > I don't think that argument is correct. That we now have two very
> similar protocols is simply a matter of history and limited developer time.
> NEP 18 discusses in several places that __array_ufunc__ should be brought
> in line with __array_ufunc__, and that we can migrate a function from one
> protocol to the other. There's no technical reason other than backwards
> compat and dev time why we couldn't use __array_function__ for ufuncs also.
> >>
> >> Huh, that's interesting! Apparently we have a profoundly different
> >> understanding of what we're doing here.
> >
> >
> > That is interesting indeed. We should figure this out first - no point
> discussing a NEP about plugging the gaps in our override system when we
> don't have a common understanding of why we wanted/needed an override
> system in the first place.
> >
> >> To me, __array_ufunc__ and
> >> __array_function__ are completely different. In fact I'd say
> >> __array_ufunc__ is a good idea and __array_function__ is a bad idea,
> >
> >
> > It's early days, but "customer feedback" certainly has been more
> enthusiastic for __array_function__. Also from what I've seen so far it
> works well. Example: at the SciPy sprints someone put together Xarray plus
> pydata/sparse to use distributed sparse arrays for visualizing some large
> genetic (I think) data sets. That was made to work in a single day, with
> impressively little code.
>
> Yeah, it's true, and __array_function__ made a bunch of stuff that
> used to be impossible become possible, I'm not saying it didn't. My
> prediction is that the longer we live with it, the more limits we'll
> hit and the more problems we'll have with long-term maintainability. I
> don't think initial enthusiasm is a good predictor of that either way.
>
> >> The key difference is that __array_ufunc__ allows for *generic*
> >> implementations.
> >
> > Implementations of what?
>
> Generic in the sense that you can write __array_ufunc__ once and have
> it work for all ufuncs.
>
> >> Most duck array libraries can write a single
> >> implementation of __array_ufunc__ that works for *all* ufuncs, even
> >> new third-party ufuncs that the duck array library has never heard of,
> >
> >
> > I see where you're going with this. You are thinking of reusing the
> ufunc implementation to do a computation. That's a minor use case (imho),
> and I can't remember seeing it used.
>
> I mean, I just looked at dask and xarray, and they're both doing
> exactly what I said, right now in shipping code. What use cases are
> you targeting here if you consider dask and xarray out-of-scope? :-)
>

I don't think that's the interesting part, or even right. When you call
`np.cos(dask_array_of_cupy_arrays)`, it certainly will not reuse the NumPy
ufunc np.cos. It will call da.cos, and that will in turn call cupy.cos. Yes
it will call np.cos if you feed it a dask array that contains a NumPy
ndarray under the hood. But that's equally true of np.mean, which is not a
ufunc. The story here is ~95% parallel for __array_ufunc__ and
__array_function__. When I said not seeing used, I meant in ways that are
fundamentally different between those two protocols.

> > this is case where knowing if something is a ufunc helps use a property
> of it. so there the more specialized nature of __array_ufunc__ helps. Seems
> niche though, and could probably also be done by checking if a function is
> an instance of np.ufunc via __array_function__
>
> Sparse arrays aren't very niche... and the isinstance trick is
> possible in some cases, but (a) it's relying on an undocumented
> implementation detail of __array_function__; according to
> __array_function__'s API contract, you could just as easily get passed
> the ufunc's __call__ method instead of the object itself,

That seems to be a matter of making it documented? Currently the dispatcher
is only attached to functions, not methods.

and (b) it
> doesn't work at all for ufunc methods like reduce, outer, accumulate.
>

No idea without looking in more detail if this can be made to work, but a
quick count in the SciPy code base says ~10 uses of .reduce, 2 of .outer
and 0 of .accumulate. So hardly showstoppers I'd say.

These are both show-stoppers IMO.
>
> > This last point, using third-party ufuncs, is the interesting one to me.
> They have to be generated with the NumPy ufunc machinery, so the dispatch
> mechanism is attached to them. We could do third party functions for
> __array_function__ too, but that would require making
> @array_function_dispatch public, which we haven't done (yet?).
>
> With __array_function__ it's theoretically possible to do the dispatch
> on third-party functions, but when someone defines a new function they
> always have to go update all the duck array libraries to hard-code in
> some special knowledge of their new function. So in my example, even
> if we made @array_function_dispatch public, you still couldn't use
> your nice new numba-created gufunc unless you first convinced dask,
> xarray, and bcolz to all accept patches to support your new gufunc.
> With __array_ufunc__, it works out-of-the-box.
>

Yep that's true. May still be better than not doing anything though, in
some cases. You'll get a TypeError with a clear message for functions that
aren't implemented, for something that otherwise likely doesn't work either.

> > But what is that road, and what do you think the goal is? To me it's:
> separate our API from our implementation. Yours seems to be "reuse our
> implementations" for __array_ufunc__, but I can't see how that generalizes
> beyond ufuncs.
>
> The road is to define *abstractions* for the operations we expose
> through our API, so that duck array implementors can work against a
> contract with well-defined preconditions and postconditions, so they
> can write code the works reliably even when the surrounding
> environment changes. That's the only way to keep things maintainable
> AFAICT. If the API contract is just a vague handwave at the numpy API,
> then no-one knows which details actually matter, it's impossible to
> test, implementations will inevitably end up with subtle long-standing
> bugs, and literally any change in numpy could potentially break duck
> array users, we don't know. So my motivation is that I like testing, I
> don't like bugs, and I like being able to maintain things with
> confidence :-). The principles are much more general than ufuncs;
> that's just a pertinent example.
>

Well, it's hard to argue with that in the abstract. I like all those things
too:)

The question is, what does that mean concretely? Most of the NumPy API,
(g)ufuncs excepted, doesn't have well-defined abstractions, and it's hard
to imagine we'll get those even if we could be more liberal with backwards
compat. Most functions are just, well, functions. You can dispatch on them,
or not. Your preference seems to be the latter, but I have a hard time
figuring out how that translates into anything but "do nothing". Do you
have a concrete alternative?

I think we've chosen to try the former - dispatch on functions so we can
reuse the NumPy API. It could work out well, it could give some long-term
maintenance issues, time will tell. The question is now if and how to plug
the gap that __array_function__ left. It's main limitation is "doesn't work
for functions that don't have an array-like input" - that left out ~10-20%
of functions. So now we have a proposal for a structural solution to that
last 10-20%. It seems logical to want that gap plugged, rather than go back
and say "we shouldn't have gone for the first 80%, so let's go no further".

> > I think this is an important point. GPUs are massively popular, and when
> very likely just continue to grow in importance. So anything we do in this
> space that says "well it works, just not for GPUs" is probably not going to
> solve our most pressing problems.
>
> I'm not saying "__array_ufunc__ doesn't work for GPUs". I'm saying
> that when it comes to GPUs, there's an upper bound for how good you
> can hope to do, and __array_ufunc__ achieves that upper bound. So does
> __array_function__. So if we only care about GPUs, they're about
> equally good.

Indeed.

But if we also care about dask and xarray and compressed
> storage and sparse storage and ... then __array_ufunc__ is strictly
> superior in those cases.

That it's superior not really interesting though is it? Their main
characteristic (the actual override) is identical, and then ufuncs go a bit
further. I think to convince me you're going to have to come up with an
actual alternative plan to `__array_ufunc__ + __array_function__ +
unumpy-or-alternative-to-it`.

And re maintenance worries: I think cleaning up our API surface and
namespaces will go *much* further than yes/no on overrides.

> So replacing __array_ufunc__ with
> __array_function__ would be a major backwards step.
>

To be 100% clear, no one is actually proposing this.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190909/1fb4f3bc/attachment-0001.html>