[Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol

Wed Aug 15 12:45:10 EDT 2018

Nathaniel,

Thanks for raising these thoughtful concerns. Your independent review of
this proposal is greatly appreciated!

See my responses inline below:

On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <njs at pobox.com> wrote:

> The other approach would be to incrementally add clean, well-defined
> dunder methods like __array_ufunc__, __array_concatenate__, etc. This
> way we end up putting some thought into each interface, making sure
> that it's something we can support, protecting downstream libraries
> from unnecessary complexity (e.g. they can implement
> __array_concatenate__ instead of hstack, vstack, row_stack,
> column_stack, ...), or avoiding adding new APIs entirely (e.g., by
> converting existing functions into ufuncs so __array_ufunc__ starts
> automagically working). And in the end we get a clean list of dunder
> methods that new array container implementations have to define. It's
> plausible to imagine a generic test suite for array containers. (I
> suspect that every library that tries to implement __array_function__
> will end up with accidental behavioral differences, just because the
> numpy API is so vast and contains so many corner cases.) So the
> clean-well-defined-dunders approach has lots of upsides. The big
> downside is that this is a much longer road to go down.
>

RE: accidental differences in behavior:

I actually think that the __array_function__ approach is *less* prone to
accidental differences in behavior, because we require implementing every
function directly (or it raises an error).

This avoids a classic subclassing problem that has plagued NumPy for years,
where overriding the behavior of method A causes apparently unrelated
method B to break, because it relied on method A internally. In NumPy, this
constrained our implementation of np.median(), because it needed to call
np.mean() in order for subclasses implementing units to work properly.

There will certainly be accidental differences in behavior for third-party
code that *uses* NumPy, but this is basically inevitable for any proposal
to allow's NumPy's public API to be overloaded. It's also avoided by
default by third-party libraries that follow the current best practice of
casting all input arrays with np.asarray().

--------------

RE: a hypothetical simplified interface:

The need to implement everything you want to use in NumPy's public API
could certainly be onerous, but on the other hand there are a long list of
projects that have already done this today -- and these are the projects
that most need __array_function__.

I'm sure there are cases were simplification would be warranted, but in
particular I don't think __array_concatenate__ has significant advantages
over simply implementing __array_function__ for np.concatenate. It's a
slightly different way of spelling, but it basically does the same thing.
The level of complexity to implement hstack, vstack, row_stack and
column_stack in terms of np.concatenate is pretty minimal.
__array_function__ implementors could easily copy and paste code from NumPy
or use a third-party helpers library (like NDArrayOperatorsMixin) that
provides such implementations.

I also have other concerns about the "simplified API" approach beyond the
difficulty of figuring it out, but those are already mentioned in the NEP:
http://www.numpy.org/neps/nep-0018-array-function-protocol.html#implementations-in-terms-of-a-limited-core-api

But... this is wishful thinking. No matter what the NEP says, I simply
> don't believe that we'll actually go break dask, sparse arrays,
> xarray, and sklearn in a numpy point release. Or any numpy release.
> Nor should we. If we're serious about keeping this experimental – and
> I think that's an excellent idea for now! – then IMO we need to do
> something more to avoid getting trapped by backwards compatibility.
>

I agree, but to be clear, development for dask, sparse and xarray (and even
broadly supported machine learning libraries like TensorFlow) still happens
at a much faster pace than is currently the case for "core" projects in the
SciPy stack like NumPy. It would not be a big deal to encounter breaking
changes in a "major" NumPy release (i.e., 1.X -> 1.(X+1)).

(Side note: sklearn doesn't directly implement any array types, so I don't
think it would make use of __array_function__ in any way, except possibly
to implement overloadable functions.)

> My suggestion: at numpy import time, check for an envvar, like say
> NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
> __array_function__ dispatches turn into no-ops. This lets interested
> downstream libraries and users try this out, but makes sure that we
> won't have a hundred thousand end users depending on it without
> realizing.

- makes it easy for end-users to check how much overhead this adds (by
> running their code with it enabled vs disabled)
> - if/when we decide to commit to supporting it for real, we just
> remove the envvar.
>

I'm slightly concerned that the cost of reading an environment variable
with os.environ could exaggerate the performance cost of
__array_function__. It takes about 1 microsecond to read an environment
variable on my laptop, which is comparable to the full overhead of
__array_function__. So we may want to switch to an explicit Python API
instead, e.g., np.enable_experimental_array_function().

My bigger concern is when/how we decide to graduate __array_function__ from
requiring an explicit opt-in. We don't need to make a final decision now,
but it would be good to clear about what specifically we are waiting for.

I see three types of likely scenarios for changing __array_function__:
1. We decide that the overloading the NumPy namespace in general is a bad
idea, based on either performance or predictability consequences for
third-party libraries. In this case, I imagine we would probably keep
__array_function__, but revert to a separate namespace for explicitly
overloaded functions, e.g., numpy.api.
2. We want to keep __array_function__, but need a breaking change to the
interface (and we're really attached to keeping the name
__array_function__).
3. We decide that specific functions should use a different interface
(e.g., switch from __array_function__ to __array_ufunc__).

(1) and (2) are the sort of major concerns that in my mind would warrant
hiding a feature behind an experimental flag. For the most part, I expect
(1) could be resolved relatively quickly by running benchmark suites after
we have a working version of __array_function__. To be honest, I don't see
either of these rollback scenarios as terribly likely, but the downside
risk is large enough that we may want to protect ourselves for a major
release or two (6-12 months).

(3) will be a much longer process, likely to stretch out over years at the
current pace of NumPy development. I don't think we'll want to keep an
opt-in flag for this long of a period. Rather, we may want to accept a
shorter deprecation cycle than usual. In most cases, I suspect we could
incrementally switch to new overloads while preserving the
__array_function__ overload for a release or two.

I don't really understand the 'types' frozenset. The NEP says "it will
> be used by most __array_function__ methods, which otherwise would need
> to extract this information themselves"... but they still need to
> extract the information themselves, because they still have to examine
> each object and figure out what type it is. And, simply creating a
> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
> possibly optimize later...
>

The most flexible alternative would be to just say that we provide an
fixed-length iterable, and return a tuple object. (In my microbenchmarks,
it's faster to make a tuple than a list or set.) In an early draft of the
NEP, I proposed exactly this, but the speed difference seemed really
marginal to me.

I included 'types' in the interface because I really do think it's
something that almost all __array_function__ implementations should use
use. It preserves a nice separation of concerns between dispatching logic
and implementations for a new type. At least as long as __array_function__
is experimental, I don't think we should be encouraging people to write
functions that could return NotImplemented directly and to rely entirely on
the NumPy interface.

Many but not all implementations will need to look at argument types. This
is only really essential for cases where mixed operations between NumPy
arrays and another type are allowed. If you only implement the NumPy
interface for MyArray objects, then in the usual Python style you wouldn't
need isinstance checks.

It's also important from an ecosystem perspective. If we don't make it easy
to get type information, my guess is that many __array_function__ authors
wouldn't bother to return NotImplemented for unexpected types, which means
that __array_function__ will break in weird ways when used with objects
from unrelated libraries.

Cheers,
Stephan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180815/5800cfba/attachment-0001.html>