
Hey all, So I've finally read through NEP 18 (__array_function__). Sorry again for the delay! It's an impressive piece of work! Thanks to the many authors; there's clearly been a lot of thought put into this. # The trade-off between comprehensive APIs versus clean APIs At a high-level, what makes me nervous about this proposal is that it reminds me of a classic software design pattern that... I don't know a name for. You might call it the "structured monkeypatching" approach to extensibility. The pattern is: a project decides they want to allow some kind of extensions or addons or plugins or something, but defining a big structured API for this is too difficult. So they take their current API surface area and declare that that's the plugin API (plus some mechanism for plugins to hook in etc.). Is this pattern good? It's... hard to say. What generally happens is: 1. You get a very complete, powerful, flexible plugin API with minimal work. 2. This quickly leads to a rich system of powerful plugins, which drives quick uptake of the project, sometimes even driving out competitors. 3. The maintainers slowly realize that committing to such a large and unstructured API is horribly unwieldy and makes changes difficult. 4. The maintainers spend huge amounts of effort trying to crawl out from under the weight of their commitments, which mixed success. Examples: pytest, sphinx: For both of these projects, writing plugins is a miserable experience, and you never really know if they'll work with new releases or when composed with random other plugins. Both projects are absolutely the dominant players in their niche, far better than the competition, largely thanks to their rich plugin ecosystems. CPython: the C extension API is basically just... all of CPython's internals dumped into a header file. Without this numpy wouldn't exist. A key ingredient in Python's miraculous popularity. Also, at this point, possibly the largest millstone preventing further improvements in Python – this is why we can't have multicore support, JITs, etc. etc.; all the most ambitious discussions at the Python language summit the last few years have circled back to "...but we can't do that b/c it will break the C API". See also: https://mail.python.org/pipermail/python-dev/2018-July/154814.html Firefox: their original extension API was basically just "our UI is written in javascript, extension modules get to throw more javascript in the pot". One of Firefox's original USPs, and a key part of like... how Mozilla even exists instead of having gone out of business a decade ago. Eventually the extension API started blocking critical architectural changes (e.g. for better sandboxing), and they had to go through an *immensely* painful migration to a properly designed API, which tooks years and burned huge amounts of goodwill. So this is like... an extreme version of technical debt. You're making a deal with the devil for wealth and fame, and then eventually the bill becomes due. It's hard for me to say categorically that this is a bad idea – empirically, it can be very successful! But there are real trade-offs. And it makes me a bit nervous that Matt is the one proposing this, because I'm pretty sure if you asked him he'd say he's absolutely focused on how to get something working ASAP and has no plans to maintain numpy in the future. The other approach would be to incrementally add clean, well-defined dunder methods like __array_ufunc__, __array_concatenate__, etc. This way we end up putting some thought into each interface, making sure that it's something we can support, protecting downstream libraries from unnecessary complexity (e.g. they can implement __array_concatenate__ instead of hstack, vstack, row_stack, column_stack, ...), or avoiding adding new APIs entirely (e.g., by converting existing functions into ufuncs so __array_ufunc__ starts automagically working). And in the end we get a clean list of dunder methods that new array container implementations have to define. It's plausible to imagine a generic test suite for array containers. (I suspect that every library that tries to implement __array_function__ will end up with accidental behavioral differences, just because the numpy API is so vast and contains so many corner cases.) So the clean-well-defined-dunders approach has lots of upsides. The big downside is that this is a much longer road to go down. I am genuinely uncertain which of these approaches is better on net, or whether we should do both. But because I'm uncertain, I'm nervous about committing to the NEP 18 approach -- it feels risky. ## Can we mitigate that risk? One thing that helps is the way the proposal makes it all-or-nothing: if you have an __array_function__ method, then you are committing to reimplementing *all* the numpy API (or at least all the parts that you want to work at all). This is arguably a bad thing in the long run, because only large and well-resourced projects can realistically hope to implement __array_function__. But for now it does somewhat mitigate the risks, because the fewer users we have the easier it is to work with them to change course later. But that's probably not enough -- "don't worry, if we change it we'll only break large, important projects with lots of users" isn't actually *that* reassuring :-). The proposal also bills itself as an unstable, provisional experiment ("this protocol should be considered strictly experimental. We reserve the right to change the details of this protocol and how specific NumPy functions use it at any time in the future – even in otherwise bug-fix only releases of NumPy."). This mitigates a lot of risk! If we aren't committing to anything, then sure, why not experiment. But... this is wishful thinking. No matter what the NEP says, I simply don't believe that we'll actually go break dask, sparse arrays, xarray, and sklearn in a numpy point release. Or any numpy release. Nor should we. If we're serious about keeping this experimental – and I think that's an excellent idea for now! – then IMO we need to do something more to avoid getting trapped by backwards compatibility. My suggestion: at numpy import time, check for an envvar, like say NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the __array_function__ dispatches turn into no-ops. This lets interested downstream libraries and users try this out, but makes sure that we won't have a hundred thousand end users depending on it without realizing. Other advantages: - makes it easy for end-users to check how much overhead this adds (by running their code with it enabled vs disabled) - if/when we decide to commit to supporting it for real, we just remove the envvar. With this change, I'm overall +1 on the proposal. Without it, I... would like more convincing, at least :-). # Minor quibbles I don't really understand the 'types' frozenset. The NEP says "it will be used by most __array_function__ methods, which otherwise would need to extract this information themselves"... but they still need to extract the information themselves, because they still have to examine each object and figure out what type it is. And, simply creating a frozenset costs ~0.2 µs on my laptop, which is overhead that we can't possibly optimize later... -n On Wed, Aug 1, 2018 at 5:27 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I propose to accept NEP-18, "A dispatch mechanism for NumPy’s high level array functions": http://www.numpy.org/neps/nep-0018-array-function-protocol.html
Since the last round of discussion, we added a new section on "Callable objects generated at runtime" clarifying that to handle such objects is out of scope for the initial proposal in the NEP.
If there are no substantive objections within 7 days from this email, then the NEP will be accepted; see NEP 0 for more details.
Cheers, Stpehan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- https://vorpus.org