Re: [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol

22 Aug 2018

      Hi Nathaniel and Stephan,

Since this conversation is getting a bit lengthy and I see a lot of repeated stuff, I’ll summarise the arguments for everyone’s benefit and then present my own viewpoints:

Nathaniel:
Undue maintenance burden on NumPy, since semantics have to match exactly
Implementations of functions may change, which may break downstream library compatibility
There may be time taken in merging this everywhere, so why not take time to define proper protocols?
Hide this entire interface behind an environment variable, possibly to be removed later.
Stephan:
Semantics don’t have to match exactly, that isn’t the intent  of most duck-arrays.
This won’t happen given NumPy’s conservativeness.
The protocols will just be copies of __array_function__, but less capable
Provide an interface that only end-users may turn on.

My viewpoints:
I don’t think any Duck array implementers intend to copy semantics on that level. Dask, which is the most complete one, doesn’t have views, only copies. Many other semantics simply don’t match. The intent is to allow for code that expresses, well, intent (no pun intended) instead of relying heavily on semantics, but that can use arbitrary duck-array implementations instead of just ndarray.
Most of the implementations in NumPy are pretty stable, and the only thing that’s likely to happen here is bug fixes. And we are free to fix bugs those; I doubt implementation-specific bugs will be copied. However, these first two points are for/against duck arrays in general, and not specific to this protocol, so IMO this discussion is completely orthogonal to this one.
I agree with Stephan here: Defining a minimum API for NumPy that will complete duck arrays will produce a lot of functions in every case that cannot be overridden, as they simply cannot be expressed in terms of the protocols we have added so far. This will lead to more protocols being produced, and so on ad infinitum. We have to consider the burden that such a design would place on the maintainers of NumPy as well… I personally feel that the amount of such protocols we’ll so need are large enough that this line of action is more burdensome, rather than less. I prefer an approach with __array_function__ + mailing list ping before adding a function.
May I propose an alternative that was already discussed, and one that I think everyone will be okay with: We put all overridable functions inside a new submodule, numpy.api, that will initially be a shallow-ish copy of the numpy module. I say ish because all modules inside NumPy will need to be shallow-copied as well. If we need to add __array_function__, we can always do that there. Normal users are using “regular” NumPy unless they know they’re using the API, but it is separately accessible. As for hiding it completely goes: We have to realise, the Python computation landscape is fragmenting. The slower we are, the more fragmented it will become. NumPy already isn’t “the standard” for machine learning.

Regards,
Hameer Abbasi
...
On 22. Aug 2018, at 03:46, Nathaniel Smith  wrote:
On Tue, Aug 21, 2018 at 9:39 AM, Stephan Hoyer mailto:shoyer@gmail.com> wrote:
...
On Tue, Aug 21, 2018 at 12:21 AM Nathaniel Smith  wrote:
...
On Wed, Aug 15, 2018 at 9:45 AM, Stephan Hoyer  wrote:
...
This avoids a classic subclassing problem that has plagued NumPy for
years,
where overriding the behavior of method A causes apparently unrelated
method
B to break, because it relied on method A internally. In NumPy, this
constrained our implementation of np.median(), because it needed to call
np.mean() in order for subclasses implementing units to work properly.
I don't think I follow... if B uses A internally, then overriding A
shouldn't cause B to break, unless the overridden A is buggy.
Let me try another example with arrays with units. My understanding of the
contract provided by unit implementations is their behavior should never
deviate from NumPy unless an operation raises an error. (This is more
explicit for arrays with units because they raise errors for operations with
incompatible units, but practically speaking almost all duck arrays will
have at least some unsupported operations in NumPy's giant API.)
It is quite possible that NumPy functions could be (re)written in a way that
is incompatible with some unit implementations but is perfectly valid for
"full" duck arrays. We actually see this even within NumPy already -- for
example, see this recent PR adding support for the datetime64 dtype to
percentile:
https://github.com/numpy/numpy/pull/11627
I clicked the link, but I don't see anything about units?
Of course units are a tricky example to make inferences from, because
they aren't a good fit for the duck array concept in general. (In
terms of numpy's core semantics, data-with-units is a special dtype,
not a special container type.)
From your mention of "full" duck arrays I guess you're thinking of
this distinction?:
http://www.numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html#princip... http://www.numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html#princip...
You're right: if numpy changes the implementation of some high-level
function to use protocol-A instead of protocol-B, and there's some
partial-duck-array that only implements protocol-B, then it gets
broken. Of course, in general __array_function__ has the same problem:
if sklearn changes their implementation of some function to call numpy
function A instead of numpy function B, and there's a
partial-duck-array that only implements numpy function B, then sklearn
is broken. I think that as duck arrays roll out, we're just going to
have to get used to dealing with breakage like this sometimes.
The advantage of __array_function__ is that we get to ignore these
issues within numpy itself. The advantage of having focused-protocols
is that they make it easier to implement full duck arrays, and they
give us a vocabulary for talking about degrees of partiality. For
example, with __array_concatenate__, a duck array either supports all
the concatenation/stacking operations or none of them – so sklearn
never has to worry that switching between np.row_stack and np.stack
will cause issues.
...
A lesser case of this are changes in NumPy causing performance issues for
users of duck arrays, which is basically inevitable if we share
implementations.
NumPy (and Python in general) is never going to make everything 100%
optimized all the time. Over and over we choose to accept small
inefficiencies in order to improve maintainability. How big are these
inefficiencies – 1% overhead, 10% overhead, 10x overhead...? Do they
show up everywhere, or just for a few key functions? What's the
maintenance cost of making NumPy's whole API overrideable, in terms of
making it harder for us to evolve numpy? What about for users dealing
with a proliferation of subtly incompatible implementations?
You may be right that the tradeoffs work out so that every API needs
to be individually overridable and the benefits are worth it, but we
at least need to be asking these questions.
...
...
And when we fix a bug in row_stack, this means we also have to fix it
in all the copy-paste versions, which won't happen, so np.row_stack
has different semantics on different objects, even if they started out
matching. The NDArrayOperatorsMixin reduces the number of duplicate
copies of the same code that need to be updated, but 2 copies is still
a lot worse than 1 copy :-).
I see your point, but in all seriousness if encounter a bug in np.row_stack
at this point we might just call it a feature instead.
Yeah, you're right, row_stack is a bad example :-). But of course the
point is that it's literally any bug-fix or added feature in numpy's
public API.
Here's a better, more concrete example: back in 2015, you added
np.stack (PR #5605), which was a great new feature. Its implementation
was entirely in terms of np.concatenate and other basic APIs like
.ndim, asanyarray, etc.
In the smallish-set-of-designed-protocols world, as soon as that's
merged into numpy, you're done: it works on sparse arrays, dask
arrays, tensorflow tensors, etc. People can use it as soon as they
upgrade their numpy.
In the __array_function__ world, merging into numpy is only the
beginning: now you have to go make new PRs to sparse, dask,
tensorflow, etc., get them merged, released, etc. Downstream projects
may refuse to use it until it's supported in multiple projects that
have their own release cycles, etc.
Or another example: at a workshop a few years ago, Matti put up some
of the source code to numpypy to demonstrate what it looked like. I
immediately spotted a subtle bug, because I happened to know that it
was one we'd found and fixed recently. (IIRC it was the thing where
arr[...] should return a view of arr, not arr itself.) Of course
indexing for duck arrays is its own mess that's somewhat orthogonal to
__array_function__, but the basic point is that numpy has a lot of
complex error-prone semantics, and we are still actively finding and
fixing issues in numpy's own implementations.
...
...
...
1. The details of how NumPy implements a high-level function in terms of
overloaded functions now becomes an implicit part of NumPy’s public API. For
example, refactoring stack to use np.block() instead of np.concatenate()
internally would now become a breaking change.
The way I'm imagining this would work is, we guarantee not to take a
function that used to be implemented in terms of overridable
operations, and refactor it so it's implemented in terms of
overridable operations. So long as people have correct implementations
of __array_concatenate__ and __array_block__, they shouldn't care
which one we use. In the interim period where we have
__array_concatenate__ but there's no such thing as __array_block__,
then that refactoring would indeed break things, so we shouldn't do
that :-). But we could fix that by adding __array_block__.
""we guarantee not to take a function that used to be implemented in terms
of overridable operations, and refactor it so it's implemented in terms of
overridable operations"
Did you miss a "not" in here somewhere, e.g., "refactor it so it's NOT
implemented"?
Yeah, sorry.
...
If we ever tried to do something like this, I'm pretty sure that it just
wouldn't happen -- unless we also change NumPy's extremely conservative
approach to breaking third-party code. np.block() is much more complex to
implement than np.concatenate(), and users would resist being forced to
handle that complexity if they don't need it. (Example: TensorFlow has a
concatenate function, but not block.)
I agree, we probably wouldn't do this particular change.
...
...
...
2. Array libraries may prefer to implement high level functions
differently than NumPy. For example, a library might prefer to implement a
fundamental operations like mean() directly rather than relying on sum()
followed by division. More generally, it’s not clear yet what exactly
qualifies as core functionality, and figuring this out could be a large
project.
True. And this is a very general problem... for example, the
appropriate way to implement logistic regression is very different
in-core versus out-of-core. You're never going to be able to take code
written for ndarray, drop in an arbitrary new array object, and get
optimal results in all cases -- that's just way too ambitious to hope
for. There will be cases where reducing to operations like sum() and
division is fine. There will be cases where you have a high-level
operation like logistic regression, where reducing to sum() and
division doesn't work, but reducing to slightly-higher-level
operations like np.mean also doesn't work, because you need to redo
the whole high-level operation. And then there will be cases where
sum() and division are too low-level, but mean() is high-level enough
to make the critical difference. It's that last one where it's
important to be able to override mean() directly. Are there a lot of
cases like this?
mean() is not entirely hypothetical. TensorFlow and Eigen actually do
implement mean separately from sum, though to be honest it's not entirely
clear to me why:
https://github.com/tensorflow/tensorflow/blob/1c1dad105a57bb13711492a8ba5ab9... https://github.com/tensorflow/tensorflow/blob/1c1dad105a57bb13711492a8ba5ab9...
https://eigen.tuxfamily.org/dox/unsupported/TensorFunctors_8h_source.html https://eigen.tuxfamily.org/dox/unsupported/TensorFunctors_8h_source.html
I do think this probably will come up with some frequency for other
operations, but the bigger answer here really is consistency -- it allows
projects and their users to have very clearly defined dependencies on
NumPy's API. They don't need to worry about any implementation details from
NumPy leaking into their override of a function.
When you say "consistency" here that means: "they can be sure that
when they disagree with the numpy devs about the
semantics/implementation of a numpy API, then the numpy API will
consistently act the way they want, not the way the numpy devs want".
Right?
This is a very double-edged sword :-).
...
...
...
3. We don’t yet have an overloading system for attributes and methods on
array objects, e.g., for accessing .dtype and .shape. This should be the
subject of a future NEP, but until then we should be reluctant to rely on
these properties.
This one I don't understand. If you have a duck-array object, and you
want to access its .dtype or .shape attributes, you just... write
myobj.dtype or myobj.shape? That doesn't need a NEP though so I must
be missing something :-).
We don't have np.asduckarray() yet or whatever we'll end up calling our
proposed casting function from NEP 22, so we don't have a fully fleshed out
mechanism for NumPy to declare "this object needs to support .shape and
.dtype, or I'm going to cast it into something that does".
That's true, but it's just as big a problem for NEP 18, because
__array_function__ is never going to do much if you've already coerced
the thing to an ndarray. Some kind of asduckarray solution is
basically a prerequisite to any other duck array features.
-n
-- 
Nathaniel J. Smith -- https://vorpus.org https://vorpus.org/
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org mailto:NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion https://mail.python.org/mailman/listinfo/numpy-discussion