NEP 31 — Context-local and global overrides of the NumPy API
Hello all,
It was recently brought to my attention that my mails to NumPy-discussion were probably going into the spam folder for many people, so here I am trying from another email. Probably Google trying to force people onto their products as usual. 😉
Me, Ralf Gommers and Peter Bell (both cc’d) have come up with a proposal on how to solve the array creation and duck array problems. The solution is outlined in NEP-31, currently in the form of a PR, [1]
Following the high level discussion in NEP-22. [2]
It would be nice to get some feedback.
Full-text of the NEP:
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
:Author: Hameer Abbasi
On Mon, Sep 2, 2019 at 2:15 AM Hameer Abbasi
Me, Ralf Gommers and Peter Bell (both cc’d) have come up with a proposal on how to solve the array creation and duck array problems. The solution is outlined in NEP-31, currently in the form of a PR, [1]
Thanks for putting this together! It'd be great to have more engagement between uarray and numpy.
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
Now that I've read this over, my main feedback is that right now it seems too vague and high-level to give it a fair evaluation? The idea of a NEP is to lay out a problem and proposed solution in enough detail that it can be evaluated and critiqued, but this felt to me more like it was pointing at some other documents for all the details and then promising that uarray has solutions for all our problems.
This NEP takes a more holistic approach: It assumes that there are parts of the API that need to be overridable, and that these will grow over time. It provides a general framework and a mechanism to avoid a design of a new protocol each time this is required.
The idea of a holistic approach makes me nervous, because I'm not sure we have holistic problems. Sometimes a holistic approach is the right thing; other times it means sweeping the actual problems under the rug, so things *look* simple and clean but in fact nothing has been solved, and they just end up biting us later. And from the NEP as currently written, I can't tell whether this is the good kind of holistic or the bad kind of holistic. Now I'm writing vague handwavey things, so let me follow my own advice and make it more concrete with an example :-). When Stephan and I were writing NEP 22, the single thing we spent the most time discussing was the problem of duck-array coercion, and in particular what to do about existing code that does np.asarray(duck_array_obj). The reason this is challenging is that there's a lot of code written in Cython/C/C++ that calls np.asarray, and then blindly casts the return value to a PyArray struct and starts accessing the raw memory fields. If np.asarray starts returning anything besides a real-actual np.ndarray object, then this code will start corrupting random memory, leading to a segfault at best. Stephan felt strongly that this meant that existing np.asarray calls *must not* ever return anything besides an np.ndarray object, and therefore we needed to add a new function np.asduckarray(), or maybe an explicit opt-in flag like np.asarray(..., allow_duck_array=True). I agreed that this was a problem, but thought we might be able to get away with an "opt-out" system, where we add an allow_duck_array= flag, but make it *default* to True, and document that the Cython/C/C++ users who want to work with a raw np.ndarray object should modify their code to explicitly call np.asarray(obj, allow_duck_array=False). This would mean that for a while people who tried to pass duck-arrays into legacy library would get segfaults, but there would be a clear path for fixing these issues as they were discovered. Either way, there are also some other details to figure out: how does this affect the C version of asarray? What about np.asfortranarray – probably that should default to allow_duck_array=False, even if we did make np.asarray default to allow_duck_array=True, right? Now if I understand right, your proposal would be to make it so any code in any package could arbitrarily change the behavior of np.asarray for all inputs, e.g. I could just decide that np.asarray([1, 2, 3]) should return some arbitrary non-np.ndarray object. It seems like this has a much greater potential for breaking existing Cython/C/C++ code, and the NEP doesn't currently describe why this extra power is useful, and it doesn't currently describe how it plans to mitigate the downsides. (For example, if a caller needs a real np.ndarray, then is there some way to explicitly request one? The NEP doesn't say.) Maybe this is all fine and there are solutions to these issues, but any proposal to address duck array coercion needs to at least talk about these issues! And that's just one example... array coercion is a particularly central and tricky problem, but the numpy API big, and there are probably other problems like this. For another example, I don't understand what the NEP is proposing to do about dtypes at all. That's why I think the NEP needs to be fleshed out a lot more before it will be possible to evaluate fairly. -n -- Nathaniel J. Smith -- https://vorpus.org
On Mon, Sep 2, 2019 at 2:09 PM Nathaniel Smith
On Mon, Sep 2, 2019 at 2:15 AM Hameer Abbasi
wrote: Me, Ralf Gommers and Peter Bell (both cc’d) have come up with a proposal on how to solve the array creation and duck array problems. The solution is outlined in NEP-31, currently in the form of a PR, [1]
Thanks for putting this together! It'd be great to have more engagement between uarray and numpy.
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
Now that I've read this over, my main feedback is that right now it seems too vague and high-level to give it a fair evaluation? The idea of a NEP is to lay out a problem and proposed solution in enough detail that it can be evaluated and critiqued, but this felt to me more like it was pointing at some other documents for all the details and then promising that uarray has solutions for all our problems.
This is fair enough I think. We'll need to put some more thought in where to refer to other NEPs, and where to be more concrete.
This NEP takes a more holistic approach: It assumes that there are parts of the API that need to be overridable, and that these will grow over time. It provides a general framework and a mechanism to avoid a design of a new protocol each time this is required.
The idea of a holistic approach makes me nervous, because I'm not sure we have holistic problems. Sometimes a holistic approach is the right thing; other times it means sweeping the actual problems under the rug, so things *look* simple and clean but in fact nothing has been solved, and they just end up biting us later. And from the NEP as currently written, I can't tell whether this is the good kind of holistic or the bad kind of holistic.
Now I'm writing vague handwavey things, so let me follow my own advice and make it more concrete with an example :-).
When Stephan and I were writing NEP 22, the single thing we spent the most time discussing was the problem of duck-array coercion, and in particular what to do about existing code that does np.asarray(duck_array_obj).
The reason this is challenging is that there's a lot of code written in Cython/C/C++ that calls np.asarray,
Cython code only perhaps? It would surprise me if there's a lot of C/C++ code that explicitly calls into our Python rather than C API. and then blindly casts the
return value to a PyArray struct and starts accessing the raw memory fields. If np.asarray starts returning anything besides a real-actual np.ndarray object, then this code will start corrupting random memory, leading to a segfault at best.
Stephan felt strongly that this meant that existing np.asarray calls *must not* ever return anything besides an np.ndarray object, and therefore we needed to add a new function np.asduckarray(), or maybe an explicit opt-in flag like np.asarray(..., allow_duck_array=True).
I agreed that this was a problem, but thought we might be able to get away with an "opt-out" system, where we add an allow_duck_array= flag, but make it *default* to True, and document that the Cython/C/C++ users who want to work with a raw np.ndarray object should modify their code to explicitly call np.asarray(obj, allow_duck_array=False). This would mean that for a while people who tried to pass duck-arrays into legacy library would get segfaults, but there would be a clear path for fixing these issues as they were discovered.
Either way, there are also some other details to figure out: how does this affect the C version of asarray? What about np.asfortranarray – probably that should default to allow_duck_array=False, even if we did make np.asarray default to allow_duck_array=True, right?
Now if I understand right, your proposal would be to make it so any code in any package could arbitrarily change the behavior of np.asarray for all inputs, e.g. I could just decide that np.asarray([1, 2, 3]) should return some arbitrary non-np.ndarray object.
No, definitely not! It's all opt-in, by explicitly importing from `numpy.overridable` or `unumpy`. No behavior of anything in the existing numpy namespaces should be affected in any way. I agree with the concerns below, hence it should stay opt-in. Cheers, Ralf It seems like this has a much greater potential for breaking
existing Cython/C/C++ code, and the NEP doesn't currently describe why this extra power is useful, and it doesn't currently describe how it plans to mitigate the downsides. (For example, if a caller needs a real np.ndarray, then is there some way to explicitly request one? The NEP doesn't say.) Maybe this is all fine and there are solutions to these issues, but any proposal to address duck array coercion needs to at least talk about these issues!
And that's just one example... array coercion is a particularly central and tricky problem, but the numpy API big, and there are probably other problems like this. For another example, I don't understand what the NEP is proposing to do about dtypes at all.
That's why I think the NEP needs to be fleshed out a lot more before it will be possible to evaluate fairly.
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, Sep 2, 2019 at 11:21 PM Ralf Gommers
On Mon, Sep 2, 2019 at 2:09 PM Nathaniel Smith
wrote: The reason this is challenging is that there's a lot of code written in Cython/C/C++ that calls np.asarray,
Cython code only perhaps? It would surprise me if there's a lot of C/C++ code that explicitly calls into our Python rather than C API.
I think there's also code written as Python-wrappers-around-C-code where the Python layer handles the error-checking/coercion, and the C code trusts it to have done so.
Now if I understand right, your proposal would be to make it so any code in any package could arbitrarily change the behavior of np.asarray for all inputs, e.g. I could just decide that np.asarray([1, 2, 3]) should return some arbitrary non-np.ndarray object.
No, definitely not! It's all opt-in, by explicitly importing from `numpy.overridable` or `unumpy`. No behavior of anything in the existing numpy namespaces should be affected in any way.
Ah, whoops, I definitely missed that :-). That does change things!
So one of the major decision points for any duck-array API work, is
whether to modify the numpy semantics "in place", so user code
automatically gets access to the new semantics, or else to make a new
namespace, that users have to switch over to manually.
The major disadvantage of doing changes "in place" is, of course, that
we have to do all this careful work to move incrementally and make
sure that we don't break things. The major (potential) advantage is
that we have a much better chance of moving the ecosystem with us.
The major advantage of making a new namespace is that it's *much*
easier to experiment, because there's no chance of breaking any
projects that didn't opt in. The major disadvantage is that numpy is
super strongly entrenched, and convincing every project to switch to
something else is incredibly difficult and costly. (I just searched
github for "import numpy" and got 17.7 million hits. That's a lot of
imports to update!) Also, empirically, we've seen multiple projects
try to do this (e.g. DyND), and so far they all failed.
It sounds like unumpy is an interesting approach that hasn't been
tried before – in particular, the promise that you can "just switch
your imports" is a much easier transition than e.g. DyND offered. Of
course, that promise is somewhat undermined by the reality that all
these potential backend libraries *aren't* 100% compatible with numpy,
and can't be... it might turn out that this ends up like asanyarray,
where you can't really use it reliably because the thing that comes
out will generally support *most* of the normal ndarray semantics, but
you don't know which part. Is scipy planning to switch to using this
everywhere, including in C code? If not, then how do you expect
projects like matplotlib to switch, given that matplotlib likes to
pass array objects into scipy functions? Are you planning to take the
opportunity to clean up some of the obscure corners of the numpy API?
But those are general questions about unumpy, and I'm guessing no-one
knows all the answers yet... and these question actually aren't super
relevant to the NEP. The NEP isn't inventing unumpy. IIUC, the main
thing the NEP is proposes is simply to make "numpy.overridable" an
alias for "unumpy".
It's not clear to me what problem this alias is solving. If all
downstream users have to update their imports anyway, then they can
write "import unumpy as np" just as easily as they can write "import
numpy.overridable as np". I guess the main reason this is a NEP is
because the unumpy project is hoping to get an "official stamp of
approval" from numpy? But even that could be accomplished by just
putting something in the docs. And adding the alias has substantial
risks: it makes unumpy tied to the numpy release cycle and
compatibility rules, and it means that we're committing to maintaining
unumpy ~forever even if Hameer or Quansight move onto other things.
That seems like a lot to take on for such vague benefits?
On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes... -n -- Nathaniel J. Smith -- https://vorpus.org
Ah, whoops, I definitely missed that :-). That does change things! So one of the major decision points for any duck-array API work, is whether to modify the numpy semantics "in place", so user code automatically gets access to the new semantics, or else to make a new namespace, that users have to switch over to manually.
The major disadvantage of doing changes "in place" is, of course, that we have to do all this careful work to move incrementally and make sure that we don't break things. The major (potential) advantage is that we have a much better chance of moving the ecosystem with us.
The major advantage of making a new namespace is that it's *much* easier to experiment, because there's no chance of breaking any projects that didn't opt in. The major disadvantage is that numpy is super strongly entrenched, and convincing every project to switch to something else is incredibly difficult and costly. (I just searched github for "import numpy" and got 17.7 million hits. That's a lot of imports to update!) Also, empirically, we've seen multiple projects try to do this (e.g. DyND), and so far they all failed.
It sounds like unumpy is an interesting approach that hasn't been tried before – in particular, the promise that you can "just switch your imports" is a much easier transition than e.g. DyND offered. Of course, that promise is somewhat undermined by the reality that all these potential backend libraries *aren't* 100% compatible with numpy, and can't be... This is true, however, with minor adjustments it should be possible to make your code work across backends, if you don't use a few obscure
it might turn out that this ends up like asanyarray, where you can't really use it reliably because the thing that comes out will generally support *most* of the normal ndarray semantics, but you don't know which part. Is scipy planning to switch to using this everywhere, including in C code? Not at present I think, however, it should be possible to "re-write"
That's a lot of very good questions! Let me see if I can answer them one-by-one. On 06.09.19 09:49, Nathaniel Smith wrote: parts of NumPy. parts of scipy on top of unumpy in order to make that work, and where speed is required and an efficient implementation isn't available in terms of NumPy functions, make dispatchable multimethods and allow library authors to provide the said implementations. We'll call this project uscipy, but that's an endgame at this point. Right now, we're focusing on unumpy.
If not, then how do you expect projects like matplotlib to switch, given that matplotlib likes to pass array objects into scipy functions? Are you planning to take the opportunity to clean up some of the obscure corners of the numpy API?
That's a completely different thing, and to answer that question requires a distinction between uarray and unumpy... uarray is a backend-mechanism, independent of array computing. We hope that matplotlib will adopt it to switch around it's GUI back-ends for example.
But those are general questions about unumpy, and I'm guessing no-one knows all the answers yet... and these question actually aren't super relevant to the NEP. The NEP isn't inventing unumpy. IIUC, the main thing the NEP is proposes is simply to make "numpy.overridable" an alias for "unumpy".
It's not clear to me what problem this alias is solving. If all downstream users have to update their imports anyway, then they can write "import unumpy as np" just as easily as they can write "import numpy.overridable as np". I guess the main reason this is a NEP is because the unumpy project is hoping to get an "official stamp of approval" from numpy?
That's part of it. The concrete problems it's solving are threefold: 1. Array creation functions can be overridden. 2. Array coercion is now covered. 3. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"... The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
But even that could be accomplished by just putting something in the docs. And adding the alias has substantial risks: it makes unumpy tied to the numpy release cycle and compatibility rules, and it means that we're committing to maintaining unumpy ~forever even if Hameer or Quansight move onto other things. That seems like a lot to take on for such vague benefits?
I can assure you Travis has had the goal of "replatforming SciPy" from as far back as I met him, he's spawned quite a few efforts in that direction along with others from Quansight (and they've led to nice projects). Quansight, as I see it, is unlikely to abandon something like this if it becomes successful (and acceptance of this NEP will be a huge success story).
The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol. But the reason we've had trouble designing these protocols is that
On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes... uarray borrows heavily from __array_function__. It allows substituting (for example) __array_ufunc__ by overriding ufunc.__call__, ufunc.reduce and so on. It takes, as I mentioned, a holistic approach: There are callables that need to be overriden, possibly with nothing to dispatch on. And then it builds on top of that, adding coercion/conversion. -n -- Nathaniel J. Smith --https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Fri, Sep 6, 2019 at 1:32 AM Hameer Abbasi
That's a lot of very good questions! Let me see if I can answer them one-by-one.
On 06.09.19 09:49, Nathaniel Smith wrote:
But even that could be accomplished by just putting something in the docs. And adding the alias has substantial risks: it makes unumpy tied to the numpy release cycle and compatibility rules, and it means that we're committing to maintaining unumpy ~forever even if Hameer or Quansight move onto other things. That seems like a lot to take on for such vague benefits?
I can assure you Travis has had the goal of "replatforming SciPy" from as far back as I met him, he's spawned quite a few efforts in that direction along with others from Quansight (and they've led to nice projects). Quansight, as I see it, is unlikely to abandon something like this if it becomes successful (and acceptance of this NEP will be a huge success story).
Let me address this separately, since it's not really a technical concern. First, this is not what we say for other contributions. E.g. we didn't say no to Pocketfft because Martin Reineck may move on, or __array_function__ because Stephan may get other interests at some point, or a whole new numpy.random, etc. Second, this is not about Quansight. At Quansight Labs we've been able to create time for Hameer to build this, and me and others to contribute - which is very nice, but the two are not tied inextricably together. In the end it's still individuals submitting this NEP. I have been a NumPy dev for ~10 years before joining Quansight, and my future NumPy contributions are not dependent on staying at Quansight (not that I plan to go anywhere!). I'm guessing the same is true for others. Third, unumpy is a fairly thin layer over uarray, which already has another user in SciPy. Cheers, Ralf
On Fri, Sep 6, 2019 at 11:44 AM Ralf Gommers
On Fri, Sep 6, 2019 at 1:32 AM Hameer Abbasi
wrote: That's a lot of very good questions! Let me see if I can answer them one-by-one.
On 06.09.19 09:49, Nathaniel Smith wrote:
But even that could be accomplished by just putting something in the docs. And adding the alias has substantial risks: it makes unumpy tied to the numpy release cycle and compatibility rules, and it means that we're committing to maintaining unumpy ~forever even if Hameer or Quansight move onto other things. That seems like a lot to take on for such vague benefits?
I can assure you Travis has had the goal of "replatforming SciPy" from as far back as I met him, he's spawned quite a few efforts in that direction along with others from Quansight (and they've led to nice projects). Quansight, as I see it, is unlikely to abandon something like this if it becomes successful (and acceptance of this NEP will be a huge success story).
Let me address this separately, since it's not really a technical concern.
First, this is not what we say for other contributions. E.g. we didn't say no to Pocketfft because Martin Reineck may move on, or __array_function__ because Stephan may get other interests at some point, or a whole new numpy.random, etc.
Second, this is not about Quansight. At Quansight Labs we've been able to create time for Hameer to build this, and me and others to contribute - which is very nice, but the two are not tied inextricably together. In the end it's still individuals submitting this NEP. I have been a NumPy dev for ~10 years before joining Quansight, and my future NumPy contributions are not dependent on staying at Quansight (not that I plan to go anywhere!). I'm guessing the same is true for others.
Third, unumpy is a fairly thin layer over uarray, which already has another user in SciPy.
I'm sorry if that came across as some kind snipe at Quansight specifically. I didn't mean it that way. It's a much more general concern: software projects are inherently risky, and often fail; companies and research labs change focus and funding shifts around. This is just a general risk that we need to take that into account when making decisions. And when there are proposals to add new submodules to numpy, we always put them under intense scrutiny, exactly because of the support commitments. The new fft and random code are replacing/extending our existing public APIs that we already committed to, so that's a very different situation. And __array_function__ was something that couldn't work at all without being built into numpy, and even then it was controversial and merged on an experimental basis. It's always about trade-offs. My concern here is that the NEP is proposing that the numpy maintainers take on this large commitment, *and* AFAICT there's no compensating benefit to justify that: everything that can be done with numpy.overridable can be done just as well with a standalone unumpy package... right? -n -- Nathaniel J. Smith -- https://vorpus.org
On Fri, Sep 6, 2019 at 5:16 PM Nathaniel Smith
On Fri, Sep 6, 2019 at 11:44 AM Ralf Gommers
wrote: On Fri, Sep 6, 2019 at 1:32 AM Hameer Abbasi
That's a lot of very good questions! Let me see if I can answer them
one-by-one.
On 06.09.19 09:49, Nathaniel Smith wrote:
But even that could be accomplished by just putting something in the docs. And adding the alias has substantial risks: it makes unumpy tied to the numpy release cycle and compatibility rules, and it means that we're committing to maintaining unumpy ~forever even if Hameer or Quansight move onto other things. That seems like a lot to take on for such vague benefits?
I can assure you Travis has had the goal of "replatforming SciPy" from
as far back as I met him, he's spawned quite a few efforts in that
wrote: direction along with others from Quansight (and they've led to nice projects). Quansight, as I see it, is unlikely to abandon something like this if it becomes successful (and acceptance of this NEP will be a huge success story).
Let me address this separately, since it's not really a technical
concern.
First, this is not what we say for other contributions. E.g. we didn't
say no to Pocketfft because Martin Reineck may move on, or __array_function__ because Stephan may get other interests at some point, or a whole new numpy.random, etc.
Second, this is not about Quansight. At Quansight Labs we've been able
to create time for Hameer to build this, and me and others to contribute - which is very nice, but the two are not tied inextricably together. In the end it's still individuals submitting this NEP. I have been a NumPy dev for ~10 years before joining Quansight, and my future NumPy contributions are not dependent on staying at Quansight (not that I plan to go anywhere!). I'm guessing the same is true for others.
Third, unumpy is a fairly thin layer over uarray, which already has
another user in SciPy.
I'm sorry if that came across as some kind snipe at Quansight specifically. I didn't mean it that way. It's a much more general concern: software projects are inherently risky, and often fail; companies and research labs change focus and funding shifts around. This is just a general risk that we need to take that into account when making decisions. And when there are proposals to add new submodules to numpy, we always put them under intense scrutiny, exactly because of the support commitments.
Yes, that's fair, and we should be critical here. All code we accept is indeed a maintenance burden.
The new fft and random code are replacing/extending our existing public APIs that we already committed to, so that's a very different situation. And __array_function__ was something that couldn't work at all without being built into numpy, and even then it was controversial and merged on an experimental basis. It's always about trade-offs. My concern here is that the NEP is proposing that the numpy maintainers take on this large commitment,
Again, not just the NumPy maintainers. There really isn't that much in `unumpy` that's all that complicated. And again, `uarray` has multiple maintainers (note that Peter is also a SciPy core dev) and has another user in SciPy. *and* AFAICT there's no compensating
benefit to justify that: everything that can be done with numpy.overridable can be done just as well with a standalone unumpy package... right?
True, mostly. But at that point, if we say that it's the way to do array coercion, and creation (and perhaps some other things as well), we're saying at the same time that every other package that needs this (e.g. Dask, CuPy) should take unumpy as a hard dependency. Which is a much bigger ask than when it comes with NumPy. We can discuss it of course. Major exception is if we want to make it default for some functionality, like for example numpy.fft (I'll answer your other email for that. Cheers, Ralf
That's a lot of very good questions! Let me see if I can answer them one-by-one.
On 06.09.19 09:49, Nathaniel Smith wrote:
But those are general questions about unumpy, and I'm guessing no-one knows all the answers yet... and these question actually aren't super relevant to the NEP. The NEP isn't inventing unumpy. IIUC, the main thing the NEP is proposes is simply to make "numpy.overridable" an alias for "unumpy".
It's not clear to me what problem this alias is solving. If all downstream users have to update their imports anyway, then they can write "import unumpy as np" just as easily as they can write "import numpy.overridable as np". I guess the main reason this is a NEP is because the unumpy project is hoping to get an "official stamp of approval" from numpy?
Also because we have NEP 30 for yet another protocol, and there's likely another NEP to follow after that for array creation. Those use cases are covered by unumpy, so it makes sense to have a NEP for that as well, so
On Fri, Sep 6, 2019 at 1:32 AM Hameer Abbasi
That's part of it. The concrete problems it's solving are threefold:
1. Array creation functions can be overridden. 2. Array coercion is now covered. 3. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"...
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
Another example is einsum: if you want to use opt_einsum for all inputs
(including ndarrays), then you cannot use np.einsum. And yet another is
using bottleneck (https://kwgoodman.github.io/bottleneck-doc/reference.html)
for nan-functions and partition. There's likely more of these.
The point is: sometimes the array protocols are preferred (e.g.
Dask/Xarray-style meta-arrays), sometimes unumpy-style dispatch works
better. It's also not necessarily an either or, they can be complementary.
Actually, after writing this I just realized something. With 1.17.x we have:
```
In [1]: import dask.array as da
In [2]: d = da.from_array(np.linspace(0, 1))
In [3]: np.fft.fft(d)
Out[3]: dask.array
prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now. Cheers, Ralf
On Fri, Sep 6, 2019 at 2:45 PM Ralf Gommers
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
unumpy doesn't help with this either though, does it? unumpy is double-opt-in: the code using np.fft has to switch to using unumpy.fft instead, and then someone has to enable the backend. But MKL/pyfftw started out as opt-in – you could `import mkl_fft` or `import pyfftw` – and the whole reason they switched to monkeypatching is that they decided that opt-in wasn't good enough for them.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Wait, but I thought the only reason we would have a dependency is if we're exporting it as part of the numpy namespace. If we keep the import as `import unumpy`, then it works just as well, without any dependency *or* vendoring in numpy, right? -n -- Nathaniel J. Smith -- https://vorpus.org
On Fri, Sep 6, 2019 at 4:51 PM Nathaniel Smith
On Fri, Sep 6, 2019 at 2:45 PM Ralf Gommers
wrote: There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
unumpy doesn't help with this either though, does it? unumpy is double-opt-in: the code using np.fft has to switch to using unumpy.fft instead, and then someone has to enable the backend.
Very good point. It would make a lot of sense to at least make unumpy default on fft/linalg/random, even if we want to keep it opt-in for the functions in the main namespace. But MKL/pyfftw
started out as opt-in – you could `import mkl_fft` or `import pyfftw` – and the whole reason they switched to monkeypatching is that they decided that opt-in wasn't good enough for them.
No, that's not correct. The MKL team has asked for a proper backend system, so they can plug into numpy rather than monkeypatch it. Oleksey, Chuck and I discussed that two years ago already at the NumFOCUS Summit 2017. This has been explicitly on the NumPy roadmap for quite a while: "A backend system for numpy.fft (so that e.g. fft-mkl doesn’t need to monkeypatch numpy)" (see https://numpy.org/neps/roadmap.html#other-functionality) And if Anaconda would like to default to it, that's possible - because one registered backend needs to be chosen as the default, that could be mkl-fft. That is still a major improvement over the situation today.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Wait, but I thought the only reason we would have a dependency is if we're exporting it as part of the numpy namespace. If we keep the import as `import unumpy`, then it works just as well, without any dependency *or* vendoring in numpy, right?
Vendoring means "include the code". So no dependency on an external package. If we don't vendor, it's going to be either unused, or end up as a dependency for the whole SciPy/PyData stack. Actually, now that we've discussed the fft issue, I'd suggest to change the NEP to: vendor, and make default for fft, random, and linalg. Cheers, Ralf
On Fri, Sep 6, 2019 at 11:04 PM Ralf Gommers
Vendoring means "include the code". So no dependency on an external package. If we don't vendor, it's going to be either unused, or end up as a dependency for the whole SciPy/PyData stack.
If we vendor it then it also ends up as a dependency for the whole SciPy/PyData stack...
Actually, now that we've discussed the fft issue, I'd suggest to change the NEP to: vendor, and make default for fft, random, and linalg.
There's no way we can have an effective discussion of duck arrays, fft backends, random backends, and linalg backends all at once in a single thread. Can you write separate NEPs for each of these? Some questions I'd like to see addressed: For fft: - fft is an entirely self-contained operation, with no interactions with the rest of the system; the only difference between implementations is speed. What problems are caused by monkeypatching, and how is uarray materially different from monkeypatching? For random: - I thought the new random implementation with pluggable generators etc. was supposed to solve this problem already. Why doesn't it? - The biggest issue with MKL monkeypatching random is that it breaks stream stability. How does the uarray approach address this? For linalg: - linalg already support __array_ufunc__ for overrides. Why do we need a second override system? Isn't that redundant? -n -- Nathaniel J. Smith -- https://vorpus.org
On Sat, Sep 7, 2019 at 4:16 PM Nathaniel Smith
Vendoring means "include the code". So no dependency on an external
On Fri, Sep 6, 2019 at 11:04 PM Ralf Gommers
wrote: package. If we don't vendor, it's going to be either unused, or end up as a dependency for the whole SciPy/PyData stack. If we vendor it then it also ends up as a dependency for the whole SciPy/PyData stack...
It seems you're just using an unusual definition here. Dependency == a package you have to install, is present in pyproject.toml/install_requires, shows up in https://github.com/numpy/numpy/network/dependencies, etc.
Actually, now that we've discussed the fft issue, I'd suggest to change the NEP to: vendor, and make default for fft, random, and linalg.
There's no way we can have an effective discussion of duck arrays, fft backends, random backends, and linalg backends all at once in a single thread.
Can you write separate NEPs for each of these? Some questions I'd like to see addressed:
For fft: - fft is an entirely self-contained operation, with no interactions with the rest of the system; the only difference between implementations is speed. What problems are caused by monkeypatching,
It was already explained in this thread, it's been on our roadmap for ~2 years at least, and monkeypatching is pretty much universally understood to be bad. If that's not enough, please search the NumPy issues for "monkeypatching". You'll find issues like https://github.com/numpy/numpy/issues/12374#issuecomment-438725645. At the moment this is very confusing, and hard to diagnose - you have to install a whole new NumPy and then find that the problem is gone (or not). Being able to switch backends in one line of code and re-test would be very valuable. It seems perhaps more useful to have a call so we can communicate with higher bandwidth, rather than lots of writing new NEPs here? In preparation, we need to write up in more detail how __array_function__ and unumpy fit together, rather than treat different pieces all separately (because the problems and pros/cons really won't change much between functions and submodules). I'll defer answering your other questions till that's done, so the discussion is hopefully a bit more structured. Cheers, Ralf and how is uarray materially different from monkeypatching?
For random: - I thought the new random implementation with pluggable generators etc. was supposed to solve this problem already. Why doesn't it? - The biggest issue with MKL monkeypatching random is that it breaks stream stability. How does the uarray approach address this?
For linalg: - linalg already support __array_ufunc__ for overrides. Why do we need a second override system? Isn't that redundant?
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Sep 7, 2019 at 5:08 PM Ralf Gommers
On Sat, Sep 7, 2019 at 4:16 PM Nathaniel Smith
wrote: On Fri, Sep 6, 2019 at 11:04 PM Ralf Gommers
wrote: Vendoring means "include the code". So no dependency on an external package. If we don't vendor, it's going to be either unused, or end up as a dependency for the whole SciPy/PyData stack.
If we vendor it then it also ends up as a dependency for the whole SciPy/PyData stack...
It seems you're just using an unusual definition here. Dependency == a package you have to install, is present in pyproject.toml/install_requires, shows up in https://github.com/numpy/numpy/network/dependencies, etc.
That's a pretty trivial definition though. Surely the complexity of the installed code and its maintainer structure is what matters, not the exact details of how the install happens.
Actually, now that we've discussed the fft issue, I'd suggest to change the NEP to: vendor, and make default for fft, random, and linalg.
There's no way we can have an effective discussion of duck arrays, fft backends, random backends, and linalg backends all at once in a single thread.
Can you write separate NEPs for each of these? Some questions I'd like to see addressed:
For fft: - fft is an entirely self-contained operation, with no interactions with the rest of the system; the only difference between implementations is speed. What problems are caused by monkeypatching,
It was already explained in this thread, it's been on our roadmap for ~2 years at least, and monkeypatching is pretty much universally understood to be bad. If that's not enough, please search the NumPy issues for "monkeypatching". You'll find issues like https://github.com/numpy/numpy/issues/12374#issuecomment-438725645. At the moment this is very confusing, and hard to diagnose - you have to install a whole new NumPy and then find that the problem is gone (or not). Being able to switch backends in one line of code and re-test would be very valuable.
Sure, it's not meant a trick question, I'm just saying you should write down the reasons and how you solve them in one place. Maybe some of the reasons monkeypatching is bad don't apply here, or maybe some of them do apply, but uarray doesn't solve them – we can't tell without doing the work. The link you gave doesn't involve monkeypatching or np.fft, so I'm not sure how it's relevant...?
It seems perhaps more useful to have a call so we can communicate with higher bandwidth, rather than lots of writing new NEPs here? In preparation, we need to write up in more detail how __array_function__ and unumpy fit together, rather than treat different pieces all separately (because the problems and pros/cons really won't change much between functions and submodules). I'll defer answering your other questions till that's done, so the discussion is hopefully a bit more structured.
I don't have a lot of time for calls, and you'd still have to write it up for everyone who isn't on the call... -n -- Nathaniel J. Smith -- https://vorpus.org
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
unumpy doesn't help with this either though, does it? unumpy is double-opt-in: the code using np.fft has to switch to using unumpy.fft instead, and then someone has to enable the backend. But MKL/pyfftw started out as opt-in – you could `import mkl_fft` or `import pyfftw` – and the whole reason they switched to monkeypatching is that they decided that opt-in wasn't good enough for them.
Because numpy functions are used to write many library functions, the end user isn't always able to opt-in by changing imports. So, for library functions, monkey patching is not simply convenient but actually necessary. Take for example scipy.signal.fftconvolve: SciPy can't change to pyfftw for licensing reasons so with SciPy < 1.4 your only option is to monkey patch scipy.fftpack and numpy.fft. However in SciPy >= 1.4, thanks to the uarray-based backend support in scipy.fft, I can write from scipy import fft, signal import pyfftw.interfaces.scipy_fft as pyfftw_fft x = np.random.randn(1024, 1024) with fft.set_backend(pyfftw_fft): y = signal.fftconvolve(x, x) # Calls pyfftw's rfft, irfft Yes, we had to opt-in in the library function (signal moved from scipy.fftpack to scipy.fft). But because there can be distance between the set_backend call and the FFT calls, the library is now much more configurable. Generally speaking, any library written to use unumpy would be configurable: (i) by the user, (ii) at runtime, (iii) without changing library code and (iv) without monkey patching. In scipy.fft I actually did it slightly differently than unumpy: the scipy.fft interface itself has the uarray dispatch and I set SciPy's version of pocketfft as the default global backend. This means that normal users don't need to set a backend, and thus don't need to opt-in in any way. For NumPy to follow this pattern as well would require more change to NumPy's code base than the current NEP's suggestion, mainly in separating the interface from the implementation that would become the default backend. - Peter
On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
That's part of it. The concrete problems it's solving are threefold: Array creation functions can be overridden. Array coercion is now covered. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"...
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
Another example is einsum: if you want to use opt_einsum for all inputs (including ndarrays), then you cannot use np.einsum. And yet another is using bottleneck ( https://kwgoodman.github.io/bottleneck-doc/reference.html) for nan- functions and partition. There's likely more of these.
The point is: sometimes the array protocols are preferred (e.g. Dask/Xarray-style meta-arrays), sometimes unumpy-style dispatch works better. It's also not necessarily an either or, they can be complementary.
Let me try to move the discussion from the github issue here (this may not be the best place). (https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`). I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching. To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area) Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different. Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy). Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the libraries above, we are lost. Solution 1: Create explicit opt-in, e.g. through uarray. (NEP-31) * Required end-user opt-in. * Seems cleaner in many ways * Requires a full copy of the API. Solution 2: Add some coercion "protocol" (NEP-30) and expose a way to create new arrays more conveniently. This would practically mean adding an `array_type=np.ndarray` argument. * _Not_ used by end-users! End users should use dask.linspace! * Adds "strange" API somewhere in numpy, and possible a new "protocol" (additionally to coercion).[2] I still feel these solve different issues. The second one is intended to make array likes work implicitly in libraries (without end users having to do anything). While the first seems to force the end user to opt in, sometimes unnecessarily: def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp Would have all the information for implicit opt-in/Array-like support, but cannot do it right now. This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in). The reason is that simply, right now I am very clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange(). Cheers, Sebastian [1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage) [2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Actually, after writing this I just realized something. With 1.17.x we have:
``` In [1]: import dask.array as da
In [2]: d = da.from_array(np.linspace(0, 1))
In [3]: np.fft.fft(d)
Out[3]: dask.array
``` In Anaconda `np.fft.fft` *is* `mkl_fft._numpy_fft.fft`, so this won't work. We have no bug report yet because 1.17.x hasn't landed in conda defaults yet (perhaps this is a/the reason why?), but it will be a problem.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Sep 7, 2019 at 1:07 PM Sebastian Berg
On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
That's part of it. The concrete problems it's solving are threefold: Array creation functions can be overridden. Array coercion is now covered. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"...
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
Another example is einsum: if you want to use opt_einsum for all inputs (including ndarrays), then you cannot use np.einsum. And yet another is using bottleneck ( https://kwgoodman.github.io/bottleneck-doc/reference.html) for nan- functions and partition. There's likely more of these.
The point is: sometimes the array protocols are preferred (e.g. Dask/Xarray-style meta-arrays), sometimes unumpy-style dispatch works better. It's also not necessarily an either or, they can be complementary.
Let me try to move the discussion from the github issue here (this may not be the best place). (https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`).
I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching.
To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area)
Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different.
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy).
Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the libraries above, we are lost.
Solution 1: Create explicit opt-in, e.g. through uarray. (NEP-31) * Required end-user opt-in.
* Seems cleaner in many ways
* Requires a full copy of the API.
bullet 1 and 3 are not required. if we decide to make it default, then there's no separate namespace
Solution 2: Add some coercion "protocol" (NEP-30) and expose a way to create new arrays more conveniently. This would practically mean adding an `array_type=np.ndarray` argument. * _Not_ used by end-users! End users should use dask.linspace! * Adds "strange" API somewhere in numpy, and possible a new "protocol" (additionally to coercion).[2]
I still feel these solve different issues. The second one is intended to make array likes work implicitly in libraries (without end users having to do anything). While the first seems to force the end user to opt in, sometimes unnecessarily:
def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp
Would have all the information for implicit opt-in/Array-like support, but cannot do it right now.
Can you explain this a bit more? `len(exp)` is a number, so `np.arange(number)` doesn't really have any information here.
This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in).
good question. if that needs to work in the absence of the user doing anything, it should be something like with unumpy.determine_backend(exp): unumpy.arange(len(exp)) # or np.arange if we make unumpy default to get the equivalent to `np.arange_like(len(exp), array_type=exp)`. Note, that `determine_backend` thing doesn't exist today. The reason is that simply, right now I am very
clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange().
I don't get the last part. The arange is inside a library function, so a user can't just go in and change things there. Cheers, Ralf
Cheers,
Sebastian
[1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage)
[2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Actually, after writing this I just realized something. With 1.17.x we have:
``` In [1]: import dask.array as da
In [2]: d = da.from_array(np.linspace(0, 1))
In [3]: np.fft.fft(d)
Out[3]: dask.array
``` In Anaconda `np.fft.fft` *is* `mkl_fft._numpy_fft.fft`, so this won't work. We have no bug report yet because 1.17.x hasn't landed in conda defaults yet (perhaps this is a/the reason why?), but it will be a problem.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into the NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 2019-09-07 15:33, Ralf Gommers wrote:
On Sat, Sep 7, 2019 at 1:07 PM Sebastian Berg
wrote: On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
That's part of it. The concrete problems it's solving are threefold: Array creation functions can be overridden. Array coercion is now covered. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"...
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
Another example is einsum: if you want to use opt_einsum for all inputs (including ndarrays), then you cannot use np.einsum. And yet another is using bottleneck ( https://kwgoodman.github.io/bottleneck-doc/reference.html) for nan- functions and partition. There's likely more of these.
The point is: sometimes the array protocols are preferred (e.g. Dask/Xarray-style meta-arrays), sometimes unumpy-style dispatch works better. It's also not necessarily an either or, they can be complementary.
Let me try to move the discussion from the github issue here (this may not be the best place). (https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`).
I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching.
To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area)
Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different.
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy).
Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the libraries above, we are lost.
Solution 1: Create explicit opt-in, e.g. through uarray. (NEP-31) * Required end-user opt-in.
* Seems cleaner in many ways * Requires a full copy of the API.
bullet 1 and 3 are not required. if we decide to make it default, then there's no separate namespace
It does require explicit opt-in to have any benefits to the user.
Solution 2: Add some coercion "protocol" (NEP-30) and expose a way to create new arrays more conveniently. This would practically mean adding an `array_type=np.ndarray` argument. * _Not_ used by end-users! End users should use dask.linspace! * Adds "strange" API somewhere in numpy, and possible a new "protocol" (additionally to coercion).[2]
I still feel these solve different issues. The second one is intended to make array likes work implicitly in libraries (without end users having to do anything). While the first seems to force the end user to opt in, sometimes unnecessarily:
def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp
Would have all the information for implicit opt-in/Array-like support, but cannot do it right now.
Can you explain this a bit more? `len(exp)` is a number, so `np.arange(number)` doesn't really have any information here.
Right, but as a library author, I want a way a way to make it use the same type as `array_like` in this particular function, that is the point! The end-user already signaled they prefer say dask, due to the array that was actually passed in. (but this is just repeating what is below I think).
This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in).
good question. if that needs to work in the absence of the user doing anything, it should be something like
with unumpy.determine_backend(exp): unumpy.arange(len(exp)) # or np.arange if we make unumpy default
to get the equivalent to `np.arange_like(len(exp), array_type=exp)`.
Note, that `determine_backend` thing doesn't exist today.
Exactly, that is what I have been wondering about, there may be more issues around that. If it existed, we may be able to solve the implicit library usage by making libraries use unumpy (or similar). Although, at that point we half replace `__array_function__` maybe. However, the main point is that without such a functionality, NEP 30 and NEP 31 seem to solve slightly different issues with respect to how they interact with the end-user (opt in)? We may decide that we do not want to solve the library users issue of wanting to support implicit opt-in for array like inputs because it is a rabbit hole. But we may need to discuss/argue a bit more that it really is a deep enough rabbit hole that it is not worth the trouble.
The reason is that simply, right now I am very clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange().
I don't get the last part. The arange is inside a library function, so a user can't just go in and change things there.
A "user" here means "end user". An end user writes a script, and they can easily change `arr = np.linspace(10)` to `arr = dask.linspace(10)`, or more likely just use one within one script and the other within another script, while both use the same sklearn functions. (Although using a backend switching may be nicer in some contexts) A library provider (library user of unumpy/numpy) of course cannot just use dask conveniently, unless they write their own `guess_numpy_like_module()` function first.
Cheers,
Ralf
Cheers,
Sebastian
[1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage)
[2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Actually, after writing this I just realized something. With 1.17.x we have:
``` In [1]: import dask.array as da
In [2]: d = da.from_array(np.linspace(0, 1))
In [3]: np.fft.fft(d)
Out[3]: dask.array
``` In Anaconda `np.fft.fft` *is* `mkl_fft._numpy_fft.fft`, so this
work. We have no bug report yet because 1.17.x hasn't landed in conda defaults yet (perhaps this is a/the reason why?), but it will be a problem.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into
won't the
NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Sep 7, 2019 at 2:18 PM sebastian
On 2019-09-07 15:33, Ralf Gommers wrote:
On Sat, Sep 7, 2019 at 1:07 PM Sebastian Berg
wrote: On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
That's part of it. The concrete problems it's solving are threefold: Array creation functions can be overridden. Array coercion is now covered. "Default implementations" will allow you to re-write your NumPy array more easily, when such efficient implementations exist in terms of other NumPy functions. That will also help achieve similar semantics, but as I said, they're just "default"...
There may be another very concrete one (that's not yet in the NEP): allowing other libraries that consume ndarrays to use overrides. An example is numpy.fft: currently both mkl_fft and pyfftw monkeypatch NumPy, something we don't like all that much (in particular for mkl_fft, because it's the default in Anaconda). `__array_function__` isn't able to help here, because it will always choose NumPy's own implementation for ndarray input. With unumpy you can support multiple libraries that consume ndarrays.
Another example is einsum: if you want to use opt_einsum for all inputs (including ndarrays), then you cannot use np.einsum. And yet another is using bottleneck ( https://kwgoodman.github.io/bottleneck-doc/reference.html) for nan- functions and partition. There's likely more of these.
The point is: sometimes the array protocols are preferred (e.g. Dask/Xarray-style meta-arrays), sometimes unumpy-style dispatch works better. It's also not necessarily an either or, they can be complementary.
Let me try to move the discussion from the github issue here (this may not be the best place). (https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`).
I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching.
To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area)
Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different.
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy).
Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the libraries above, we are lost.
Solution 1: Create explicit opt-in, e.g. through uarray. (NEP-31) * Required end-user opt-in.
* Seems cleaner in many ways * Requires a full copy of the API.
bullet 1 and 3 are not required. if we decide to make it default, then there's no separate namespace
It does require explicit opt-in to have any benefits to the user.
Solution 2: Add some coercion "protocol" (NEP-30) and expose a way to create new arrays more conveniently. This would practically mean adding an `array_type=np.ndarray` argument. * _Not_ used by end-users! End users should use dask.linspace! * Adds "strange" API somewhere in numpy, and possible a new "protocol" (additionally to coercion).[2]
I still feel these solve different issues. The second one is intended to make array likes work implicitly in libraries (without end users having to do anything). While the first seems to force the end user to opt in, sometimes unnecessarily:
def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp
Would have all the information for implicit opt-in/Array-like support, but cannot do it right now.
Can you explain this a bit more? `len(exp)` is a number, so `np.arange(number)` doesn't really have any information here.
Right, but as a library author, I want a way a way to make it use the same type as `array_like` in this particular function, that is the point! The end-user already signaled they prefer say dask, due to the array that was actually passed in. (but this is just repeating what is below I think).
Okay, you meant conceptually:)
This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in).
good question. if that needs to work in the absence of the user doing anything, it should be something like
with unumpy.determine_backend(exp): unumpy.arange(len(exp)) # or np.arange if we make unumpy default
to get the equivalent to `np.arange_like(len(exp), array_type=exp)`.
Note, that `determine_backend` thing doesn't exist today.
Exactly, that is what I have been wondering about, there may be more issues around that. If it existed, we may be able to solve the implicit library usage by making libraries use unumpy (or similar). Although, at that point we half replace `__array_function__` maybe.
I don't really think so. Libraries can/will still use __array_function__ for most functionality, and just add a `with determine_backend` for the places where __array_function__ doesn't work.
However, the main point is that without such a functionality, NEP 30 and NEP 31 seem to solve slightly different issues with respect to how they interact with the end-user (opt in)?
Yes, I agree with that. Cheers, Ralf
We may decide that we do not want to solve the library users issue of wanting to support implicit opt-in for array like inputs because it is a rabbit hole. But we may need to discuss/argue a bit more that it really is a deep enough rabbit hole that it is not worth the trouble.
The reason is that simply, right now I am very clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange().
I don't get the last part. The arange is inside a library function, so a user can't just go in and change things there.
A "user" here means "end user". An end user writes a script, and they can easily change `arr = np.linspace(10)` to `arr = dask.linspace(10)`, or more likely just use one within one script and the other within another script, while both use the same sklearn functions. (Although using a backend switching may be nicer in some contexts)
A library provider (library user of unumpy/numpy) of course cannot just use dask conveniently, unless they write their own `guess_numpy_like_module()` function first.
Cheers,
Ralf
Cheers,
Sebastian
[1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage)
[2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Actually, after writing this I just realized something. With 1.17.x we have:
``` In [1]: import dask.array as da
In [2]: d = da.from_array(np.linspace(0, 1))
In [3]: np.fft.fft(d)
Out[3]: dask.array
``` In Anaconda `np.fft.fft` *is* `mkl_fft._numpy_fft.fft`, so this
work. We have no bug report yet because 1.17.x hasn't landed in conda defaults yet (perhaps this is a/the reason why?), but it will be a problem.
The import numpy.overridable part is meant to help garner adoption, and to prefer the unumpy module if it is available (which will continue to be developed separately). That way it isn't so tightly coupled to the release cycle. One alternative Sebastian Berg mentioned (and I am on board with) is just moving unumpy into
won't the
NumPy organisation. What we fear keeping it separate is that the simple act of a pip install unumpy will keep people from using it or trying it out.
Note that this is not the most critical aspect. I pushed for vendoring as numpy.overridable because I want to not derail the comparison with NEP 30 et al. with a "should we add a dependency" discussion. The interesting part to decide on first is: do we need the unumpy override mechanism? Vendoring opt-in vs. making it default vs. adding a dependency is of secondary interest right now.
Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 07.09.19 22:06, Sebastian Berg wrote:
On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
Let me try to move the discussion from the github issue here (this may not be the best place). (https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`).
I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching.
To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area)
Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different.
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy). You can, once you register the backend it becomes implicit, so all backends are tried until one succeeds. Unless you explicitly say "I do not want another backend" (only/coerce=True).
Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the libraries above, we are lost.
Solution 1: Create explicit opt-in, e.g. through uarray. (NEP-31) * Required end-user opt-in. * Seems cleaner in many ways * Requires a full copy of the API.
Solution 2: Add some coercion "protocol" (NEP-30) and expose a way to create new arrays more conveniently. This would practically mean adding an `array_type=np.ndarray` argument. * _Not_ used by end-users! End users should use dask.linspace! * Adds "strange" API somewhere in numpy, and possible a new "protocol" (additionally to coercion).[2]
I still feel these solve different issues. The second one is intended to make array likes work implicitly in libraries (without end users having to do anything). While the first seems to force the end user to opt in, sometimes unnecessarily:
def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp
Would have all the information for implicit opt-in/Array-like support, but cannot do it right now. This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in). The reason is that simply, right now I am very clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange().
Sure, the end user can, but library authors cannot. And end users may want to easily port code to GPU or between back-ends, just as library authors might.
Cheers,
Sebastian
[1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage)
[2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Handling things like RandomState can get complicated here. <snip>
On Tue, 2019-09-10 at 17:28 +0200, Hameer Abbasi wrote:
On 07.09.19 22:06, Sebastian Berg wrote:
On Fri, 2019-09-06 at 14:45 -0700, Ralf Gommers wrote:
<snip>
Let me try to move the discussion from the github issue here (this may not be the best place). ( https://github.com/numpy/numpy/issues/14441 which asked for easier creation functions together with `__array_function__`).
I think an important note mentioned here is how users interact with unumpy, vs. __array_function__. The former is an explicit opt-in, while the latter is implicit choice based on an `array-like` abstract base class and functional type based dispatching.
To quote NEP 18 on this: "The downsides are that this would require an explicit opt-in from all existing code, e.g., import numpy.api as np, and in the long term would result in the maintenance of two separate NumPy APIs. Also, many functions from numpy itself are already overloaded (but inadequately), so confusion about high vs. low level APIs in NumPy would still persist." (I do think this is a point we should not just ignore, `uarray` is a thin layer, but it has a big surface area)
Now there are things where explicit opt-in is obvious. And the FFT example is one of those, there is no way to implicitly choose another backend (except by just replacing it, i.e. monkeypatching) [1]. And right now I think these are _very_ different.
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy). You can, once you register the backend it becomes implicit, so all backends are tried until one succeeds. Unless you explicitly say "I do not want another backend" (only/coerce=True).
The thing here being "once you register the backend". Thus requiring at least in some form an explicit opt-in by the end user. Also, unless you use the with statement (with all the scoping rules attached), you cannot plug the coercion/creation hole left by `__array_function__`.
Circling back to creation and coercion. In a purely Object type system, these would be classmethods, I guess, but in NumPy and the <snip>
def my_library_func(array_like): exp = np.exp(array_like) idx = np.arange(len(exp)) return idx, exp
Would have all the information for implicit opt-in/Array-like support, but cannot do it right now. This is what I have been wondering, if uarray/unumpy, can in some way help me make this work (even _without_ the end user opting in). The reason is that simply, right now I am very clear on the need for this use case, but not sure about the need for end user opt in, since end users can just use dask.arange().
Sure, the end user can, but library authors cannot. And end users may want to easily port code to GPU or between back-ends, just as library authors might.
Yes, but library authors want to solve the particular thing above right now, and I am still not sure how uarray helps there. If it does, then only with a added complexity _and_ (at least currently) explicit end- user opt-in. Now, I am not a particularly good judge for these things, but I have been trying to figure out how things can improve with it and still I am tempted to say that uarray is a giant step in no particular direction at all. Of course it _can_ solve everything, but right now it seems like it might require a py2 -> py3 like transition. And even then it is so powerful, that it probably comes with its own bunch of issues (such as far away side effects due to scoping of with statements). Best, Sebastian
Cheers,
Sebastian
[1] To be honest, I do think a lot of the "issues" around monkeypatching exists just as much with backend choosing, the main difference seems to me that a lot of that: 1. monkeypatching was not done explicit (import mkl_fft; mkl_fft.monkeypatch_numpy())? 2. A backend system allows libaries to prefer one locally? (which I think is a big advantage)
[2] There are the options of adding `linspace_like` functions somewhere in a numpy submodule, or adding `linspace(..., array_type=np.ndarray)`, or simply inventing a new "protocl" (which is not really a protocol?), and make it `ndarray.__numpy_like_creation_functions__.arange()`.
Handling things like RandomState can get complicated here.
<snip>
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Sep 10, 2019 at 10:53 AM Sebastian Berg
On Tue, 2019-09-10 at 17:28 +0200, Hameer Abbasi wrote:
On 07.09.19 22:06, Sebastian Berg wrote:
Now for the end-users choosing one array-like over another, seems nicer as an implicit mechanism (why should I not mix sparse, dask and numpy arrays!?). This is the promise `__array_function__` tries to make. Unless convinced otherwise, my guess is that most library authors would strive for implicit support (i.e. sklearn, skimage, scipy).
You can, once you register the backend it becomes implicit, so all backends are tried until one succeeds. Unless you explicitly say "I do not want another backend" (only/coerce=True).
The thing here being "once you register the backend". Thus requiring at least in some form an explicit opt-in by the end user. Also, unless you use the with statement (with all the scoping rules attached), you cannot plug the coercion/creation hole left by `__array_function__`.
The need for this is clear I think. We're discussion on the unumpy repo whether this can be done with a minor change to how unumpy works, or by having backend auto-register somehow on import. It should be possible without mandating that the end user has to explicitly do something, but needs some thought. Stay tuned. Cheers, Ralf
On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
On Mon, Sep 2, 2019 at 11:21 PM Ralf Gommers
wrote: On Mon, Sep 2, 2019 at 2:09 PM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also. Cheers, Ralf
On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also.
Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here. To me, __array_ufunc__ and __array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea, and would definitely not be in favor of combining them together. The key difference is that __array_ufunc__ allows for *generic* implementations. Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of, because ufuncs all share the same structure of a loop wrapped around a core operation, and they can treat the core operation as a black box. For example: - Dask can split up the operation across its tiled sub-arrays, and then for each tile it invokes the core operation. - xarray can do its label-based axis matching, and then invoke the core operation. - bcolz can loop over the array uncompressing one block at a time, invoking the core operation on each. - sparse arrays can check the ufunc .identity attribute to find out whether 0 is an identity, and if so invoke the operation directly on the non-zero entries; otherwise, it can loop over the array and densify it in blocks and invoke the core operation on each. (It would be useful to have a bit more metadata on the ufunc, so e.g. np.subtract could declare that zero is a right-identity but not a left-identity, but that's a simple enough extension to make at some point.) Result: __array_ufunc__ makes it totally possible to take a ufunc from scipy.special or a random new on created with numba, and have it immediately work on an xarray wrapped around dask wrapped around bcolz, out-of-the-box. That's a clean, generic interface. [1] OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs. To me, the whole point of interfaces is to reduce coupling. When you have N interacting modules, it's unmaintainable if every change requires considering every N! combination. From this perspective, __array_function__ isn't good, but it is still somewhat constrained: the result of each operation is still determined by the objects involved, nothing else. In this regard, uarray even more extreme than __array_function__, because arbitrary operations can be arbitrarily changed by arbitrarily distant code. It sort of feels like the argument for uarray is: well, designing maintainable interfaces is a lot of work, so forget it, let's just make it easy to monkeypatch everything and call it a day. That said, in my replies in this thread I've been trying to stay productive and focus on narrower concrete issues. I'm pretty sure that __array_function__ and uarray will turn out to be bad ideas and will fail, but that's not a proven fact, it's just an informed guess. And the road that I favor also has lots of risks and uncertainty. So I don't have a problem with trying both as experiments and learning more! But hopefully that explains why it's not at all obvious that uarray solves the protocol design problems we've been talking about. -n [1] There are also some cases that __array_ufunc__ doesn't handle as nicely. One obvious one is that GPU/TPU libraries still need to special-case individual ufuncs. But that's not a limitation of __array_ufunc__, it's a limitation of GPUs – they can't run CPU code, so they can't use the CPU implementation of the core operations. Another limitation is that __array_ufunc__ is weak at handling operations that involve mixed libraries (e.g. np.add(bcolz_array, sparse_array)) – to work well, this might require that bcolz have special-case handling for sparse arrays, or vice-versa, so you still potentially have some N**2 special cases, though at least here N is the number of duck array libraries, not the number of ufuncs. I think this is an interesting target for future work. But in general, __array_ufunc__ goes a long way to taming the complexity of interacting libraries and ufuncs. -- Nathaniel J. Smith -- https://vorpus.org
On 08.09.19 09:53, Nathaniel Smith wrote:
On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
wrote: On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol. But the reason we've had trouble designing these protocols is that
On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: they're each different . If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes... I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also. Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here. To me, __array_ufunc__ and __array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea, and would definitely not be in favor of combining them together. The key difference is that __array_ufunc__ allows for *generic* implementations. Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of, because ufuncs all share the same structure of a loop wrapped around a core operation, and they can treat the core operation as a black box. For example:
- Dask can split up the operation across its tiled sub-arrays, and then for each tile it invokes the core operation. - xarray can do its label-based axis matching, and then invoke the core operation. - bcolz can loop over the array uncompressing one block at a time, invoking the core operation on each. - sparse arrays can check the ufunc .identity attribute to find out whether 0 is an identity, and if so invoke the operation directly on the non-zero entries; otherwise, it can loop over the array and densify it in blocks and invoke the core operation on each. (It would be useful to have a bit more metadata on the ufunc, so e.g. np.subtract could declare that zero is a right-identity but not a left-identity, but that's a simple enough extension to make at some point.)
Result: __array_ufunc__ makes it totally possible to take a ufunc from scipy.special or a random new on created with numba, and have it immediately work on an xarray wrapped around dask wrapped around bcolz, out-of-the-box. That's a clean, generic interface. [1]
OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs.
But uarray does allow this kind of simplification. You would do the following inside a uarray backend: def __ua_function__(func, args, kwargs): Â Â Â with ua.skip_backend(self_backend): Â Â Â Â Â Â Â # Do code here, dispatches to everything but This is possible today and is done in the dask backend inside unumpy for example.
To me, the whole point of interfaces is to reduce coupling. When you have N interacting modules, it's unmaintainable if every change requires considering every N! combination. From this perspective, __array_function__ isn't good, but it is still somewhat constrained: the result of each operation is still determined by the objects involved, nothing else. In this regard, uarray even more extreme than __array_function__, because arbitrary operations can be arbitrarily changed by arbitrarily distant code. It sort of feels like the argument for uarray is: well, designing maintainable interfaces is a lot of work, so forget it, let's just make it easy to monkeypatch everything and call it a day.
That said, in my replies in this thread I've been trying to stay productive and focus on narrower concrete issues. I'm pretty sure that __array_function__ and uarray will turn out to be bad ideas and will fail, but that's not a proven fact, it's just an informed guess. And the road that I favor also has lots of risks and uncertainty. So I don't have a problem with trying both as experiments and learning more! But hopefully that explains why it's not at all obvious that uarray solves the protocol design problems we've been talking about.
-n
[1] There are also some cases that __array_ufunc__ doesn't handle as nicely. One obvious one is that GPU/TPU libraries still need to special-case individual ufuncs. But that's not a limitation of __array_ufunc__, it's a limitation of GPUs – they can't run CPU code, so they can't use the CPU implementation of the core operations. Another limitation is that __array_ufunc__ is weak at handling operations that involve mixed libraries (e.g. np.add(bcolz_array, sparse_array)) – to work well, this might require that bcolz have special-case handling for sparse arrays, or vice-versa, so you still potentially have some N**2 special cases, though at least here N is the number of duck array libraries, not the number of ufuncs. I think this is an interesting target for future work. But in general, __array_ufunc__ goes a long way to taming the complexity of interacting libraries and ufuncs.
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sun, Sep 8, 2019 at 1:04 AM Hameer Abbasi
On 08.09.19 09:53, Nathaniel Smith wrote:
OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs.
But uarray does allow this kind of simplification. You would do the following inside a uarray backend:
def __ua_function__(func, args, kwargs): with ua.skip_backend(self_backend): # Do code here, dispatches to everything but
You can dispatch to the underlying operation, sure, but you can't implement a generic ufunc loop because you don't know that 'func' is actually a bound ufunc method, or have any way to access the underlying ufunc object. (E.g. consider the case where 'func' is 'np.add.reduce'.) The critical part of my example was that it's a new ufunc that none of these libraries have ever heard of before. Ufuncs have lot of consistent structure beyond what generic Python callables have, and the whole point of __array_ufunc__ is that implementors can rely on that structure. You get to work at a higher level of abstraction. A similar but simpler example would be the protocol we've sketched out for concatenation: the idea would be to capture the core similarity between np.concatenate/np.hstack/np.vstack/np.dstack/np.column_stack/np.row_stack/any other variants, so that implementors only have to worry about the higher-level concept of "concatenation" rather than the raw APIs of all those individual functions. -n -n -- Nathaniel J. Smith -- https://vorpus.org
On 08.09.19 10:56, Nathaniel Smith wrote:
On Sun, Sep 8, 2019 at 1:04 AM Hameer Abbasi
wrote: On 08.09.19 09:53, Nathaniel Smith wrote:
OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs. But uarray does allow this kind of simplification. You would do the following inside a uarray backend:
def __ua_function__(func, args, kwargs): with ua.skip_backend(self_backend): # Do code here, dispatches to everything but You can dispatch to the underlying operation, sure, but you can't implement a generic ufunc loop because you don't know that 'func' is actually a bound ufunc method, or have any way to access the underlying ufunc object. (E.g. consider the case where 'func' is 'np.add.reduce'.) The critical part of my example was that it's a new ufunc that none of these libraries have ever heard of before.
Ufuncs have lot of consistent structure beyond what generic Python callables have, and the whole point of __array_ufunc__ is that implementors can rely on that structure. You get to work at a higher level of abstraction.
A similar but simpler example would be the protocol we've sketched out for concatenation: the idea would be to capture the core similarity between np.concatenate/np.hstack/np.vstack/np.dstack/np.column_stack/np.row_stack/any other variants, so that implementors only have to worry about the higher-level concept of "concatenation" rather than the raw APIs of all those individual functions.
There's a solution for that too: Default implementations. Implement concatenate, and you've got a default implementation for all of those you mentioned. Similarly for transpose/swapaxis/moveaxis and family.
-n
-n
On 08.09.19 10:56, Nathaniel Smith wrote:
On Sun, Sep 8, 2019 at 1:04 AM Hameer Abbasi
wrote: On 08.09.19 09:53, Nathaniel Smith wrote:
OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs. But uarray does allow this kind of simplification. You would do the following inside a uarray backend:
def __ua_function__(func, args, kwargs): with ua.skip_backend(self_backend): # Do code here, dispatches to everything but You can dispatch to the underlying operation, sure, but you can't implement a generic ufunc loop because you don't know that 'func' is actually a bound ufunc method, or have any way to access the underlying ufunc object. (E.g. consider the case where 'func' is 'np.add.reduce'.) The critical part of my example was that it's a new ufunc that none of these libraries have ever heard of before. You don't get np.add.reduce, you get np.ufunc.reduce with self=np.add. So you can access the underlying ufunc and the method, nothing limiting about that. Ufuncs have lot of consistent structure beyond what generic Python callables have, and the whole point of __array_ufunc__ is that implementors can rely on that structure. You get to work at a higher level of abstraction.
A similar but simpler example would be the protocol we've sketched out for concatenation: the idea would be to capture the core similarity between np.concatenate/np.hstack/np.vstack/np.dstack/np.column_stack/np.row_stack/any other variants, so that implementors only have to worry about the higher-level concept of "concatenation" rather than the raw APIs of all those individual functions.
-n
-n
On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith
On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one
On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
wrote: protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also. Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here.
That is interesting indeed. We should figure this out first - no point discussing a NEP about plugging the gaps in our override system when we don't have a common understanding of why we wanted/needed an override system in the first place. To me, __array_ufunc__ and
__array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea,
It's early days, but "customer feedback" certainly has been more enthusiastic for __array_function__. Also from what I've seen so far it works well. Example: at the SciPy sprints someone put together Xarray plus pydata/sparse to use distributed sparse arrays for visualizing some large genetic (I think) data sets. That was made to work in a single day, with impressively little code. and would definitely not be in favor of combining them together.
I'm not saying we should. But __array_ufunc__ is basically a slight specialization - knowing that the function that was called is a ufunc can be handy but is usually irrelevant.
The key difference is that __array_ufunc__ allows for *generic* implementations.
Implementations of what? Most duck array libraries can write a single
implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of,
I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used. The original use case was scipy.sparse matrices. The executive summary of NEP 13 talks about this. It's about calling `np.some_ufunc(other_ndarray_like)` and "handing over control" to that object rather than the numpy function starting to execute. Also note that NEP 13 states in the summary "This covers some of the same ground as Travis Oliphant’s proposal to retro-fit NumPy with multi-methods" (reminds one of uarray....). For scipy.sparse, the layout of the data doesn't make sense to numpy. All that was desired was that the sparse matrix needs to know what function was called, so it can call its own implementation of that function instead. because ufuncs all share the same structure of a loop wrapped around a
core operation, and they can treat the core operation as a black box. For example:
- Dask can split up the operation across its tiled sub-arrays, and then for each tile it invokes the core operation.
Works for __array_function__ too. Note, *not* by explicitly reusing the numpy function. Dask anyway has its own functions that mirror the numpy API. Dask's __array_function__ just does the forwarding to its own functions. Also, a Dask array could be a collection of CuPy arrays, and CuPy implements __array_ufunc__. So explicitly reusing the NumPy ufunc implementation on whatever comes in would be, well, not so nice. - xarray can do its label-based axis matching, and then invoke the
core operation.
Could do this with __array_function__ too - bcolz can loop over the array uncompressing one block at a time,
invoking the core operation on each.
not sure about this one - sparse arrays can check the ufunc .identity attribute this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__ to find out
whether 0 is an identity, and if so invoke the operation directly on the non-zero entries; otherwise, it can loop over the array and densify it in blocks and invoke the core operation on each. (It would be useful to have a bit more metadata on the ufunc, so e.g. np.subtract could declare that zero is a right-identity but not a left-identity, but that's a simple enough extension to make at some point.)
Result: __array_ufunc__ makes it totally possible to take a ufunc from scipy.special or a random new on created with numba, and have it immediately work on an xarray wrapped around dask wrapped around bcolz, out-of-the-box. That's a clean, generic interface. [1]
This last point, using third-party ufuncs, is the interesting one to me. They have to be generated with the NumPy ufunc machinery, so the dispatch mechanism is attached to them. We could do third party functions for __array_function__ too, but that would require making @array_function_dispatch public, which we haven't done (yet?).
OTOH, __array_function__ doesn't allow this kind of simplification: if we were using __array_function__ for ufuncs, every library would have to special-case every individual ufunc, which leads to dramatically more work and more potential for bugs.
This all assumes that "reusing the ufunc's implementation" is the one thing that matters. To me that's a small side benefit, which we haven't seen a whole lot of use of in the 2+ years that __array_ufunc__ was available. I think that what (for example) CuPy does - use __array_ufunc__ to simply take over control, is both the major use case and the original motivation for introducing the protocol.
To me, the whole point of interfaces is to reduce coupling. When you have N interacting modules, it's unmaintainable if every change requires considering every N! combination. From this perspective, __array_function__ isn't good, but it is still somewhat constrained: the result of each operation is still determined by the objects involved, nothing else. In this regard, uarray even more extreme than __array_function__, because arbitrary operations can be arbitrarily changed by arbitrarily distant code. It sort of feels like the argument for uarray is: well, designing maintainable interfaces is a lot of work, so forget it, let's just make it easy to monkeypatch everything and call it a day.
That said, in my replies in this thread I've been trying to stay productive and focus on narrower concrete issues. I'm pretty sure that __array_function__ and uarray will turn out to be bad ideas and will fail, but that's not a proven fact, it's just an informed guess. And the road that I favor also has lots of risks and uncertainty.
But what is that road, and what do you think the goal is? To me it's: separate our API from our implementation. Yours seems to be "reuse our implementations" for __array_ufunc__, but I can't see how that generalizes beyond ufuncs. So I don't have a problem with trying both as experiments and learning
more! But hopefully that explains why it's not at all obvious that uarray solves the protocol design problems we've been talking about.
-n
[1] There are also some cases that __array_ufunc__ doesn't handle as nicely. One obvious one is that GPU/TPU libraries still need to special-case individual ufuncs. But that's not a limitation of __array_ufunc__, it's a limitation of GPUs
I think this is an important point. GPUs are massively popular, and when very likely just continue to grow in importance. So anything we do in this space that says "well it works, just not for GPUs" is probably not going to solve our most pressing problems. – they can't run CPU code,
so they can't use the CPU implementation of the core operations. Another limitation is that __array_ufunc__ is weak at handling operations that involve mixed libraries (e.g. np.add(bcolz_array, sparse_array)) – to work well, this might require that bcolz have special-case handling for sparse arrays, or vice-versa, so you still potentially have some N**2 special cases, though at least here N is the number of duck array libraries, not the number of ufuncs. I think this is an interesting target for future work. But in general, __array_ufunc__ goes a long way to taming the complexity of interacting libraries and ufuncs.
With *only* ufuncs you can't create that many interesting applications, you need the other functions too...... Cheers, Ralf
On Sun, Sep 8, 2019 at 8:40 AM Ralf Gommers
On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith
wrote: On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
wrote: On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi
wrote: The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also.
Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here.
That is interesting indeed. We should figure this out first - no point discussing a NEP about plugging the gaps in our override system when we don't have a common understanding of why we wanted/needed an override system in the first place.
To me, __array_ufunc__ and __array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea,
It's early days, but "customer feedback" certainly has been more enthusiastic for __array_function__. Also from what I've seen so far it works well. Example: at the SciPy sprints someone put together Xarray plus pydata/sparse to use distributed sparse arrays for visualizing some large genetic (I think) data sets. That was made to work in a single day, with impressively little code.
Yeah, it's true, and __array_function__ made a bunch of stuff that used to be impossible become possible, I'm not saying it didn't. My prediction is that the longer we live with it, the more limits we'll hit and the more problems we'll have with long-term maintainability. I don't think initial enthusiasm is a good predictor of that either way.
The key difference is that __array_ufunc__ allows for *generic* implementations.
Implementations of what?
Generic in the sense that you can write __array_ufunc__ once and have it work for all ufuncs.
Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of,
I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used.
I mean, I just looked at dask and xarray, and they're both doing exactly what I said, right now in shipping code. What use cases are you targeting here if you consider dask and xarray out-of-scope? :-)
this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__
Sparse arrays aren't very niche... and the isinstance trick is possible in some cases, but (a) it's relying on an undocumented implementation detail of __array_function__; according to __array_function__'s API contract, you could just as easily get passed the ufunc's __call__ method instead of the object itself, and (b) it doesn't work at all for ufunc methods like reduce, outer, accumulate. These are both show-stoppers IMO.
This last point, using third-party ufuncs, is the interesting one to me. They have to be generated with the NumPy ufunc machinery, so the dispatch mechanism is attached to them. We could do third party functions for __array_function__ too, but that would require making @array_function_dispatch public, which we haven't done (yet?).
With __array_function__ it's theoretically possible to do the dispatch on third-party functions, but when someone defines a new function they always have to go update all the duck array libraries to hard-code in some special knowledge of their new function. So in my example, even if we made @array_function_dispatch public, you still couldn't use your nice new numba-created gufunc unless you first convinced dask, xarray, and bcolz to all accept patches to support your new gufunc. With __array_ufunc__, it works out-of-the-box.
But what is that road, and what do you think the goal is? To me it's: separate our API from our implementation. Yours seems to be "reuse our implementations" for __array_ufunc__, but I can't see how that generalizes beyond ufuncs.
The road is to define *abstractions* for the operations we expose through our API, so that duck array implementors can work against a contract with well-defined preconditions and postconditions, so they can write code the works reliably even when the surrounding environment changes. That's the only way to keep things maintainable AFAICT. If the API contract is just a vague handwave at the numpy API, then no-one knows which details actually matter, it's impossible to test, implementations will inevitably end up with subtle long-standing bugs, and literally any change in numpy could potentially break duck array users, we don't know. So my motivation is that I like testing, I don't like bugs, and I like being able to maintain things with confidence :-). The principles are much more general than ufuncs; that's just a pertinent example.
I think this is an important point. GPUs are massively popular, and when very likely just continue to grow in importance. So anything we do in this space that says "well it works, just not for GPUs" is probably not going to solve our most pressing problems.
I'm not saying "__array_ufunc__ doesn't work for GPUs". I'm saying that when it comes to GPUs, there's an upper bound for how good you can hope to do, and __array_ufunc__ achieves that upper bound. So does __array_function__. So if we only care about GPUs, they're about equally good. But if we also care about dask and xarray and compressed storage and sparse storage and ... then __array_ufunc__ is strictly superior in those cases. So replacing __array_ufunc__ with __array_function__ would be a major backwards step. -n -- Nathaniel J. Smith -- https://vorpus.org
On Sun, Sep 8, 2019 at 7:27 PM Nathaniel Smith
On Sun, Sep 8, 2019 at 8:40 AM Ralf Gommers
wrote: On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith
wrote: On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi < einstein.edison@gmail.com> wrote:
The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one
Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here.
That is interesting indeed. We should figure this out first - no point discussing a NEP about plugging the gaps in our override system when we don't have a common understanding of why we wanted/needed an override system in the first place.
To me, __array_ufunc__ and __array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea,
It's early days, but "customer feedback" certainly has been more enthusiastic for __array_function__. Also from what I've seen so far it works well. Example: at the SciPy sprints someone put together Xarray plus
wrote: protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also. pydata/sparse to use distributed sparse arrays for visualizing some large genetic (I think) data sets. That was made to work in a single day, with impressively little code.
Yeah, it's true, and __array_function__ made a bunch of stuff that used to be impossible become possible, I'm not saying it didn't. My prediction is that the longer we live with it, the more limits we'll hit and the more problems we'll have with long-term maintainability. I don't think initial enthusiasm is a good predictor of that either way.
The key difference is that __array_ufunc__ allows for *generic* implementations.
Implementations of what?
Generic in the sense that you can write __array_ufunc__ once and have it work for all ufuncs.
Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of,
I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used.
I mean, I just looked at dask and xarray, and they're both doing exactly what I said, right now in shipping code. What use cases are you targeting here if you consider dask and xarray out-of-scope? :-)
this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__
Sparse arrays aren't very niche... and the isinstance trick is possible in some cases, but (a) it's relying on an undocumented implementation detail of __array_function__; according to __array_function__'s API contract, you could just as easily get passed the ufunc's __call__ method instead of the object itself, and (b) it doesn't work at all for ufunc methods like reduce, outer, accumulate. These are both show-stoppers IMO.
This last point, using third-party ufuncs, is the interesting one to me. They have to be generated with the NumPy ufunc machinery, so the dispatch mechanism is attached to them. We could do third party functions for __array_function__ too, but that would require making @array_function_dispatch public, which we haven't done (yet?).
With __array_function__ it's theoretically possible to do the dispatch on third-party functions, but when someone defines a new function they always have to go update all the duck array libraries to hard-code in some special knowledge of their new function. So in my example, even if we made @array_function_dispatch public, you still couldn't use your nice new numba-created gufunc unless you first convinced dask, xarray, and bcolz to all accept patches to support your new gufunc. With __array_ufunc__, it works out-of-the-box.
But what is that road, and what do you think the goal is? To me it's: separate our API from our implementation. Yours seems to be "reuse our implementations" for __array_ufunc__, but I can't see how that generalizes beyond ufuncs.
The road is to define *abstractions* for the operations we expose through our API, so that duck array implementors can work against a contract with well-defined preconditions and postconditions, so they can write code the works reliably even when the surrounding environment changes. That's the only way to keep things maintainable AFAICT. If the API contract is just a vague handwave at the numpy API, then no-one knows which details actually matter, it's impossible to test, implementations will inevitably end up with subtle long-standing bugs, and literally any change in numpy could potentially break duck array users, we don't know. So my motivation is that I like testing, I don't like bugs, and I like being able to maintain things with confidence :-). The principles are much more general than ufuncs; that's just a pertinent example.
I think this is an important point. GPUs are massively popular, and when very likely just continue to grow in importance. So anything we do in this space that says "well it works, just not for GPUs" is probably not going to solve our most pressing problems.
I'm not saying "__array_ufunc__ doesn't work for GPUs". I'm saying that when it comes to GPUs, there's an upper bound for how good you can hope to do, and __array_ufunc__ achieves that upper bound. So does __array_function__. So if we only care about GPUs, they're about equally good. But if we also care about dask and xarray and compressed storage and sparse storage and ... then __array_ufunc__ is strictly superior in those cases. So replacing __array_ufunc__ with __array_function__ would be a major backwards step.
One case that hasn’t been brought up in this thread is unit-handling. For example, unyt’s array_ufunc implementation explicitly handles ufuncs and will bail if someone tries to use a ufunc that unyt doesn’t know about. I tried to implement a completely generic solution but ended up concluding I couldn’t do that without silently generating answers with incorrect units. I definitely agree with your analysis that this sort of implementation is error-prone, in fact we just had to do a bugfix release to fix clip suddenly not working now that it’s a ufunc in numpy 1.17.
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sun, Sep 8, 2019 at 6:27 PM Nathaniel Smith
On Sun, Sep 8, 2019 at 8:40 AM Ralf Gommers
wrote: On Sun, Sep 8, 2019 at 12:54 AM Nathaniel Smith
wrote: On Fri, Sep 6, 2019 at 11:53 AM Ralf Gommers
On Fri, Sep 6, 2019 at 12:53 AM Nathaniel Smith
wrote: On Tue, Sep 3, 2019 at 2:04 AM Hameer Abbasi < einstein.edison@gmail.com> wrote:
The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
But the reason we've had trouble designing these protocols is that they're each different :-). If it was just a matter of copying __array_ufunc__ we'd have been done in a few minutes...
I don't think that argument is correct. That we now have two very similar protocols is simply a matter of history and limited developer time. NEP 18 discusses in several places that __array_ufunc__ should be brought in line with __array_ufunc__, and that we can migrate a function from one
Huh, that's interesting! Apparently we have a profoundly different understanding of what we're doing here.
That is interesting indeed. We should figure this out first - no point discussing a NEP about plugging the gaps in our override system when we don't have a common understanding of why we wanted/needed an override system in the first place.
To me, __array_ufunc__ and __array_function__ are completely different. In fact I'd say __array_ufunc__ is a good idea and __array_function__ is a bad idea,
It's early days, but "customer feedback" certainly has been more enthusiastic for __array_function__. Also from what I've seen so far it works well. Example: at the SciPy sprints someone put together Xarray plus
wrote: protocol to the other. There's no technical reason other than backwards compat and dev time why we couldn't use __array_function__ for ufuncs also. pydata/sparse to use distributed sparse arrays for visualizing some large genetic (I think) data sets. That was made to work in a single day, with impressively little code.
Yeah, it's true, and __array_function__ made a bunch of stuff that used to be impossible become possible, I'm not saying it didn't. My prediction is that the longer we live with it, the more limits we'll hit and the more problems we'll have with long-term maintainability. I don't think initial enthusiasm is a good predictor of that either way.
The key difference is that __array_ufunc__ allows for *generic* implementations.
Implementations of what?
Generic in the sense that you can write __array_ufunc__ once and have it work for all ufuncs.
Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of,
I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used.
I mean, I just looked at dask and xarray, and they're both doing exactly what I said, right now in shipping code. What use cases are you targeting here if you consider dask and xarray out-of-scope? :-)
I don't think that's the interesting part, or even right. When you call `np.cos(dask_array_of_cupy_arrays)`, it certainly will not reuse the NumPy ufunc np.cos. It will call da.cos, and that will in turn call cupy.cos. Yes it will call np.cos if you feed it a dask array that contains a NumPy ndarray under the hood. But that's equally true of np.mean, which is not a ufunc. The story here is ~95% parallel for __array_ufunc__ and __array_function__. When I said not seeing used, I meant in ways that are fundamentally different between those two protocols.
this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__
Sparse arrays aren't very niche... and the isinstance trick is possible in some cases, but (a) it's relying on an undocumented implementation detail of __array_function__; according to __array_function__'s API contract, you could just as easily get passed the ufunc's __call__ method instead of the object itself,
That seems to be a matter of making it documented? Currently the dispatcher is only attached to functions, not methods. and (b) it
doesn't work at all for ufunc methods like reduce, outer, accumulate.
No idea without looking in more detail if this can be made to work, but a quick count in the SciPy code base says ~10 uses of .reduce, 2 of .outer and 0 of .accumulate. So hardly showstoppers I'd say. These are both show-stoppers IMO.
This last point, using third-party ufuncs, is the interesting one to me. They have to be generated with the NumPy ufunc machinery, so the dispatch mechanism is attached to them. We could do third party functions for __array_function__ too, but that would require making @array_function_dispatch public, which we haven't done (yet?).
With __array_function__ it's theoretically possible to do the dispatch on third-party functions, but when someone defines a new function they always have to go update all the duck array libraries to hard-code in some special knowledge of their new function. So in my example, even if we made @array_function_dispatch public, you still couldn't use your nice new numba-created gufunc unless you first convinced dask, xarray, and bcolz to all accept patches to support your new gufunc. With __array_ufunc__, it works out-of-the-box.
Yep that's true. May still be better than not doing anything though, in some cases. You'll get a TypeError with a clear message for functions that aren't implemented, for something that otherwise likely doesn't work either.
But what is that road, and what do you think the goal is? To me it's: separate our API from our implementation. Yours seems to be "reuse our implementations" for __array_ufunc__, but I can't see how that generalizes beyond ufuncs.
The road is to define *abstractions* for the operations we expose through our API, so that duck array implementors can work against a contract with well-defined preconditions and postconditions, so they can write code the works reliably even when the surrounding environment changes. That's the only way to keep things maintainable AFAICT. If the API contract is just a vague handwave at the numpy API, then no-one knows which details actually matter, it's impossible to test, implementations will inevitably end up with subtle long-standing bugs, and literally any change in numpy could potentially break duck array users, we don't know. So my motivation is that I like testing, I don't like bugs, and I like being able to maintain things with confidence :-). The principles are much more general than ufuncs; that's just a pertinent example.
Well, it's hard to argue with that in the abstract. I like all those things too:) The question is, what does that mean concretely? Most of the NumPy API, (g)ufuncs excepted, doesn't have well-defined abstractions, and it's hard to imagine we'll get those even if we could be more liberal with backwards compat. Most functions are just, well, functions. You can dispatch on them, or not. Your preference seems to be the latter, but I have a hard time figuring out how that translates into anything but "do nothing". Do you have a concrete alternative? I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I think this is an important point. GPUs are massively popular, and when very likely just continue to grow in importance. So anything we do in this space that says "well it works, just not for GPUs" is probably not going to solve our most pressing problems.
I'm not saying "__array_ufunc__ doesn't work for GPUs". I'm saying that when it comes to GPUs, there's an upper bound for how good you can hope to do, and __array_ufunc__ achieves that upper bound. So does __array_function__. So if we only care about GPUs, they're about equally good.
Indeed. But if we also care about dask and xarray and compressed
storage and sparse storage and ... then __array_ufunc__ is strictly superior in those cases.
That it's superior not really interesting though is it? Their main characteristic (the actual override) is identical, and then ufuncs go a bit further. I think to convince me you're going to have to come up with an actual alternative plan to `__array_ufunc__ + __array_function__ + unumpy-or-alternative-to-it`. And re maintenance worries: I think cleaning up our API surface and namespaces will go *much* further than yes/no on overrides.
So replacing __array_ufunc__ with __array_function__ would be a major backwards step.
To be 100% clear, no one is actually proposing this. Cheers, Ralf
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__. I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL. (1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare. (2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*. (3) could potentially motivate a new solution, but it should exist *inside* of select existing NumPy implementations, after checking for overrides with __array_function__. If the only option NumPy provides for overriding np.fft is to implement np.overrideable.fft, I doubt that would suffice to convince MKL developers from monkey patching it -- they already decided that a separate namespace is not good enough for them. I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases. Best, Stephan
On Mon, 2019-09-09 at 20:32 -0700, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__.
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
There is an issue open about adding more functions for that. Made me
wonder if giving a method of choosing the duck-array whose
`__array_function__` is used, could not solve it reasonably.
Similar to explicitly choosing a specific template version to call in
templated code. In other words `np.arange
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
(3) could potentially motivate a new solution, but it should exist *inside* of select existing NumPy implementations, after checking for overrides with __array_function__. If the only option NumPy provides for overriding np.fft is to implement np.overrideable.fft, I doubt that would suffice to convince MKL developers from monkey patching it -- they already decided that a separate namespace is not good enough for them.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
In other words `np.arange
(100)` (but with a completely different syntax, probably hidden away only for libraries to use).
It sounds an bit like you're describing factory classmethods there. Is the
solution to this problem to move (leaving behind aliases) `np.arange` to
`ndarray.arange`, `np.zeros` to `ndarray.zeros`, etc - callers then would
use `type(duckarray).zeros` if they're trying to generalize.
Eric
On Mon, Sep 9, 2019, 21:18 Sebastian Berg
On Mon, 2019-09-09 at 20:32 -0700, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__.
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
There is an issue open about adding more functions for that. Made me wonder if giving a method of choosing the duck-array whose `__array_function__` is used, could not solve it reasonably. Similar to explicitly choosing a specific template version to call in templated code. In other words `np.arange
(100)` (but with a completely different syntax, probably hidden away only for libraries to use). Maybe it is indeed time to write up a list of options to plug that hole, and then see where it brings us.
Best,
Sebastian
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
(3) could potentially motivate a new solution, but it should exist *inside* of select existing NumPy implementations, after checking for overrides with __array_function__. If the only option NumPy provides for overriding np.fft is to implement np.overrideable.fft, I doubt that would suffice to convince MKL developers from monkey patching it -- they already decided that a separate namespace is not good enough for them.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, 2019-09-09 at 22:26 -0700, Eric Wieser wrote:
In other words `np.arange
(100)` (but with a completely different syntax, probably hidden away only for libraries to use). It sounds an bit like you're describing factory classmethods there. Is the solution to this problem to move (leaving behind aliases) `np.arange` to `ndarray.arange`, `np.zeros` to `ndarray.zeros`, etc - callers then would use `type(duckarray).zeros` if they're trying to generalize.
Yeah, factory classmethod is probably the better way to describe it. The question is where you hide them away conveniently (and how to access them). And of course if/what completely different alternatives exist. In a sense, `__array_function__` is a bit like a collection of operator dunder methods, I guess. So, we need another collection for classmethods. And that was the quick, possibly silly, idea to also use `__array_function__`. So yeah, there is not much of a point in not simply creating another place for them, or even using individual dunder classmethods. But we still an "operator"/function to access them nicely, unless we want to force `type(duckarray).…` on library authors. I guess the important thing is mostly what would be convenient to downstreams implementers. - Sebastian
Eric
On Mon, Sep 9, 2019, 21:18 Sebastian Berg
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers < ralf.gommers@gmail.com> wrote:
I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__.
I would much rather continue to develop specialized protocols for
remaining usecases. Summarizing those I've seen in this thread,
wrote: On Mon, 2019-09-09 at 20:32 -0700, Stephan Hoyer wrote: the these
include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
There is an issue open about adding more functions for that. Made me wonder if giving a method of choosing the duck-array whose `__array_function__` is used, could not solve it reasonably. Similar to explicitly choosing a specific template version to call in templated code. In other words `np.arange
(100)` (but with a completely different syntax, probably hidden away only for libraries to use). Maybe it is indeed time to write up a list of options to plug that hole, and then see where it brings us.
Best,
Sebastian
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
(3) could potentially motivate a new solution, but it should exist *inside* of select existing NumPy implementations, after checking for overrides with __array_function__. If the only option NumPy provides for overriding np.fft is to implement np.overrideable.fft, I doubt that would suffice to convince MKL developers from monkey patching it -- they already decided that a separate namespace is not good enough for them.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 10.09.19 05:32, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
mailto:ralf.gommers@gmail.com> wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__.
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
While they're rare for libraries like XArray; CuPy, Dask and PyData/Sparse need these.
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
We already have a split in the type system with e.g. Cython's buffers, Numba's parallel type system. This is a different issue altogether, e.g. allowing a unyt dtype to spawn a unyt array, rather than forcing a re-write of unyt to cooperate with NumPy's new dtype system.
(3) could potentially motivate a new solution, but it should exist *inside* of select existing NumPy implementations, after checking for overrides with __array_function__. If the only option NumPy provides for overriding np.fft is to implement np.overrideable.fft, I doubt that would suffice to convince MKL developers from monkey patching it -- they already decided that a separate namespace is not good enough for them.
That has already been addressed by Ralf in another email. We're proposing to merge that into NumPy proper. Also, you're missing a few: 4. Having default implementations that allow overrides of a large part of the API while defining only a small part. This holds for e.g. transpose/concatenate. 5. Generation of Random numbers (overriding RandomState). CuPy has its own implementation which would be nice to override.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Backend switching needs global context, in any case. There isn't a good way around that other than the class dundermethods outlined in another thread, which would require rewrites of large amounts of code.
Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Tue, Sep 10, 2019 at 6:06 AM Hameer Abbasi
On 10.09.19 05:32, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching, but the unumpy interface suggested here (numpy.overridable) feels like a redundant redo of __array_function__ and __array_ufunc__.
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
While they're rare for libraries like XArray; CuPy, Dask and PyData/Sparse need these.
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
We already have a split in the type system with e.g. Cython's buffers, Numba's parallel type system. This is a different issue altogether, e.g. allowing a unyt dtype to spawn a unyt array, rather than forcing a re-write of unyt to cooperate with NumPy's new dtype system.
I guess you're proposing that operations like np.sum(numpy_array, dtype=other_dtype) could rely on other_dtype for the implementation and potentially return a non-NumPy array? I'm not sure this is well motivated -- it would be helpful to discuss actual use-cases. The most commonly used NumPy functionality related to dtypes can be found only in methods on np.ndarray, e.g., astype() and view(). But I don't think there's any proposal to change that.
4. Having default implementations that allow overrides of a large part of the API while defining only a small part. This holds for e.g. transpose/concatenate.
I'm not sure how unumpy solve the problems we encountered when trying to do this with __array_function__ -- namely the way that it exposes all of NumPy's internals, or requires rewriting a lot of internal NumPy code to ensure it always casts inputs with asarray(). I think it would be useful to expose default implementations of NumPy operations somewhere to make it easier to implement __array_function__, but it doesn't make much sense to couple this to user facing overrides. These can be exposed as a separate package or numpy module (e.g., numpy.default_implementations) that uses np.duckarray(), which library authors can make use of by calling inside their __aray_function__ methods.
5. Generation of Random numbers (overriding RandomState). CuPy has its own implementation which would be nice to override.
I'm not sure that NumPy's random state objects make sense for duck arrays. Because these are stateful objects, they are pretty coupled to NumPy's implementation -- you cannot store any additional state on RandomState objects that might be needed for a new implementation. At a bare minimum, you will loss the reproducibility of random seeds, though this may be less of a concern with the new random API.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Backend switching needs global context, in any case. There isn't a good way around that other than the class dundermethods outlined in another thread, which would require rewrites of large amounts of code.
Do we really need to support robust backend switching in NumPy? I'm not strongly opposed, but what use cases does it actually solve to be able to override np.fft.fft rather than using a new function? At some point, if you want maximum performance you won't be writing the code using NumPy proper anyways. At best you'll be using a system with duck-array support like CuPy.
On Tue, Sep 10, 2019 at 6:06 AM Hameer Abbasi
wrote: On 10.09.19 05:32, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching,
Great! I think most (but not all) of us are on the same page here. Actually now that Peter came up with the `like=` keyword idea for array creation functions I'm very interested in seeing that worked out, feels
but the unumpy interface suggested here (numpy.overridable) feels like a
redundant redo of __array_function__ and __array_ufunc__.
A bit of context: a big part of the reason I advocated for numpy.overridable is that library authors can use it *only* for the parts not already covered by the protocols we already have. If there's overlap
On Tue, Sep 10, 2019 at 10:59 AM Stephan Hoyer
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
While they're rare for libraries like XArray; CuPy, Dask and PyData/Sparse need these.
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
We already have a split in the type system with e.g. Cython's buffers, Numba's parallel type system. This is a different issue altogether, e.g. allowing a unyt dtype to spawn a unyt array, rather than forcing a re-write of unyt to cooperate with NumPy's new dtype system.
I guess you're proposing that operations like np.sum(numpy_array, dtype=other_dtype) could rely on other_dtype for the implementation and potentially return a non-NumPy array? I'm not sure this is well motivated -- it would be helpful to discuss actual use-cases.
The most commonly used NumPy functionality related to dtypes can be found only in methods on np.ndarray, e.g., astype() and view(). But I don't think there's any proposal to change that.
4. Having default implementations that allow overrides of a large part of the API while defining only a small part. This holds for e.g. transpose/concatenate.
I'm not sure how unumpy solve the problems we encountered when trying to do this with __array_function__ -- namely the way that it exposes all of NumPy's internals, or requires rewriting a lot of internal NumPy code to ensure it always casts inputs with asarray().
I think it would be useful to expose default implementations of NumPy operations somewhere to make it easier to implement __array_function__, but it doesn't make much sense to couple this to user facing overrides. These can be exposed as a separate package or numpy module (e.g., numpy.default_implementations) that uses np.duckarray(), which library authors can make use of by calling inside their __aray_function__ methods.
5. Generation of Random numbers (overriding RandomState). CuPy has its own implementation which would be nice to override.
I'm not sure that NumPy's random state objects make sense for duck arrays. Because these are stateful objects, they are pretty coupled to NumPy's implementation -- you cannot store any additional state on RandomState objects that might be needed for a new implementation. At a bare minimum, you will loss the reproducibility of random seeds, though this may be less of a concern with the new random API.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Backend switching needs global context, in any case. There isn't a good way around that other than the class dundermethods outlined in another thread, which would require rewrites of large amounts of code.
Do we really need to support robust backend switching in NumPy? I'm not strongly opposed, but what use cases does it actually solve to be able to override np.fft.fft rather than using a new function?
I don't know, but that feels like an odd question. We wanted an FFT backend system. Now applying __array_function__ to numpy.fft happened without a real discussion, but as a backend system I don't think it would have met the criteria. Something that works for CuPy, Dask and Xarray, but not for Pyfftw or mkl_fft is only half a solution. Cheers, Ralf
On Wed, Sep 11, 2019 at 4:18 PM Ralf Gommers
On Tue, Sep 10, 2019 at 10:59 AM Stephan Hoyer
wrote: On Tue, Sep 10, 2019 at 6:06 AM Hameer Abbasi
wrote: On 10.09.19 05:32, Stephan Hoyer wrote:
On Mon, Sep 9, 2019 at 6:27 PM Ralf Gommers
wrote: I think we've chosen to try the former - dispatch on functions so we can reuse the NumPy API. It could work out well, it could give some long-term maintenance issues, time will tell. The question is now if and how to plug the gap that __array_function__ left. It's main limitation is "doesn't work for functions that don't have an array-like input" - that left out ~10-20% of functions. So now we have a proposal for a structural solution to that last 10-20%. It seems logical to want that gap plugged, rather than go back and say "we shouldn't have gone for the first 80%, so let's go no further".
I'm excited about solving the remaining 10-20% of use cases for flexible array dispatching,
Great! I think most (but not all) of us are on the same page here. Actually now that Peter came up with the `like=` keyword idea for array creation functions I'm very interested in seeing that worked out, feels like that could be a nice solution for part of that 10-20% that did look pretty bad before.
but the unumpy interface suggested here (numpy.overridable) feels like a
redundant redo of __array_function__ and __array_ufunc__.
A bit of context: a big part of the reason I advocated for numpy.overridable is that library authors can use it *only* for the parts not already covered by the protocols we already have. If there's overlap there's several ways to deal with that, including only including part of the unumpy API surface. It does plug all the holes in one go (although you can then indeed argue it does too much), and there is no other coherent proposal/vision yet that does this. What you wrote below comes closest, and I'd love to see that worked out (e.g. the like= argument for array creation). What I don't like is an ad-hoc plugging of one hole at a time without visibility on how many more protocols and new workaround functions in the API we would need. So hopefully we can come to an apples-to-apples comparison of two design alternatives.
Also, we just discussed this whole thread in the community call, and it's clear that it's a complex matter with many different angles. It's very hard to get a full overview. Our conclusion in the call was that this will benefit from an in-person discussion. The sprint in November may be a really good opportunity for that.
Sounds good, I'm looking forward to the discussion at the November sprint!
In the meantime we can of course keep working out ideas/docs. For now I think it's clear that we (the NEP authors) have some homework to do - that may take some time.
I would much rather continue to develop specialized protocols for the remaining usecases. Summarizing those I've seen in this thread, these include: 1. Overrides for customizing array creation and coercion. 2. Overrides to implement operations for new dtypes. 3. Overriding implementations of NumPy functions, e.g., FFT and ufuncs with MKL.
(1) could mostly be solved by adding np.duckarray() and another function for duck array coercion. There is still the matter of overriding np.zeros and the like, which perhaps justifies another new protocol, but in my experience the use-cases for truly an array from scratch are quite rare.
While they're rare for libraries like XArray; CuPy, Dask and PyData/Sparse need these.
(2) should be tackled as part of overhauling NumPy's dtype system to better support user defined dtypes. But it should definitely be in the form of specialized protocols, e.g., which pass in preallocated arrays to into ufuncs for a new dtype. By design, new dtypes should not be able to customize the semantics of array *structure*.
We already have a split in the type system with e.g. Cython's buffers, Numba's parallel type system. This is a different issue altogether, e.g. allowing a unyt dtype to spawn a unyt array, rather than forcing a re-write of unyt to cooperate with NumPy's new dtype system.
I guess you're proposing that operations like np.sum(numpy_array, dtype=other_dtype) could rely on other_dtype for the implementation and potentially return a non-NumPy array? I'm not sure this is well motivated -- it would be helpful to discuss actual use-cases.
The most commonly used NumPy functionality related to dtypes can be found only in methods on np.ndarray, e.g., astype() and view(). But I don't think there's any proposal to change that.
4. Having default implementations that allow overrides of a large part of the API while defining only a small part. This holds for e.g. transpose/concatenate.
I'm not sure how unumpy solve the problems we encountered when trying to do this with __array_function__ -- namely the way that it exposes all of NumPy's internals, or requires rewriting a lot of internal NumPy code to ensure it always casts inputs with asarray().
I think it would be useful to expose default implementations of NumPy operations somewhere to make it easier to implement __array_function__, but it doesn't make much sense to couple this to user facing overrides. These can be exposed as a separate package or numpy module (e.g., numpy.default_implementations) that uses np.duckarray(), which library authors can make use of by calling inside their __aray_function__ methods.
5. Generation of Random numbers (overriding RandomState). CuPy has its own implementation which would be nice to override.
I'm not sure that NumPy's random state objects make sense for duck arrays. Because these are stateful objects, they are pretty coupled to NumPy's implementation -- you cannot store any additional state on RandomState objects that might be needed for a new implementation. At a bare minimum, you will loss the reproducibility of random seeds, though this may be less of a concern with the new random API.
I also share Nathaniel's concern that the overrides in unumpy are too powerful, by allowing for control from arbitrary function arguments and even *non-local* control (i.e., global variables) from context managers. This level of flexibility can make code very hard to debug, especially in larger codebases.
Backend switching needs global context, in any case. There isn't a good way around that other than the class dundermethods outlined in another thread, which would require rewrites of large amounts of code.
Do we really need to support robust backend switching in NumPy? I'm not strongly opposed, but what use cases does it actually solve to be able to override np.fft.fft rather than using a new function?
I don't know, but that feels like an odd question. We wanted an FFT backend system. Now applying __array_function__ to numpy.fft happened without a real discussion, but as a backend system I don't think it would have met the criteria. Something that works for CuPy, Dask and Xarray, but not for Pyfftw or mkl_fft is only half a solution.
I agree, __array_function__ is not a backend system.
Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 09.09.19 03:26, Nathaniel Smith wrote:
[snip] Generic in the sense that you can write __array_ufunc__ once and have it work for all ufuncs. You can do that too with __ua_function__, you get np.ufunc.__call__, with self=<any-ufunc>. The same holds for say, RandomState objects, once implemented.
Most duck array libraries can write a single implementation of __array_ufunc__ that works for *all* ufuncs, even new third-party ufuncs that the duck array library has never heard of,
I see where you're going with this. You are thinking of reusing the ufunc implementation to do a computation. That's a minor use case (imho), and I can't remember seeing it used. I mean, I just looked at dask and xarray, and they're both doing exactly what I said, right now in shipping code. What use cases are you targeting here if you consider dask and xarray out-of-scope? :-)
this is case where knowing if something is a ufunc helps use a property of it. so there the more specialized nature of __array_ufunc__ helps. Seems niche though, and could probably also be done by checking if a function is an instance of np.ufunc via __array_function__ Sparse arrays aren't very niche... and the isinstance trick is possible in some cases, but (a) it's relying on an undocumented implementation detail of __array_function__; according to __array_function__'s API contract, you could just as easily get passed the ufunc's __call__ method instead of the object itself, and (b) it doesn't work at all for ufunc methods like reduce, outer, accumulate. These are both show-stoppers IMO.
It does work for all ufunc methods. You just get passed in the appropriate method (ufunc.reduce, ufunc.accumulate, ...), with self=<any-ufunc>.
[snip]
Hi Nathaniel, On 02.09.19 23:09, Nathaniel Smith wrote:
On Mon, Sep 2, 2019 at 2:15 AM Hameer Abbasi
wrote: Me, Ralf Gommers and Peter Bell (both cc’d) have come up with a proposal on how to solve the array creation and duck array problems. The solution is outlined in NEP-31, currently in the form of a PR, [1] Thanks for putting this together! It'd be great to have more engagement between uarray and numpy.
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================ Now that I've read this over, my main feedback is that right now it seems too vague and high-level to give it a fair evaluation? The idea of a NEP is to lay out a problem and proposed solution in enough detail that it can be evaluated and critiqued, but this felt to me more like it was pointing at some other documents for all the details and then promising that uarray has solutions for all our problems.
This NEP takes a more holistic approach: It assumes that there are parts of the API that need to be overridable, and that these will grow over time. It provides a general framework and a mechanism to avoid a design of a new protocol each time this is required. The idea of a holistic approach makes me nervous, because I'm not sure we have holistic problems.
The fact that we're having to design more and more protocols for a lot of very similar things is, to me, an indicator that we do have holistic problems that ought to be solved by a single protocol.
Sometimes a holistic approach is the right thing; other times it means sweeping the actual problems under the rug, so things *look* simple and clean but in fact nothing has been solved, and they just end up biting us later. And from the NEP as currently written, I can't tell whether this is the good kind of holistic or the bad kind of holistic.
Now I'm writing vague handwavey things, so let me follow my own advice and make it more concrete with an example :-).
When Stephan and I were writing NEP 22, the single thing we spent the most time discussing was the problem of duck-array coercion, and in particular what to do about existing code that does np.asarray(duck_array_obj).
The reason this is challenging is that there's a lot of code written in Cython/C/C++ that calls np.asarray, and then blindly casts the return value to a PyArray struct and starts accessing the raw memory fields. If np.asarray starts returning anything besides a real-actual np.ndarray object, then this code will start corrupting random memory, leading to a segfault at best.
Stephan felt strongly that this meant that existing np.asarray calls *must not* ever return anything besides an np.ndarray object, and therefore we needed to add a new function np.asduckarray(), or maybe an explicit opt-in flag like np.asarray(..., allow_duck_array=True).
I agreed that this was a problem, but thought we might be able to get away with an "opt-out" system, where we add an allow_duck_array= flag, but make it *default* to True, and document that the Cython/C/C++ users who want to work with a raw np.ndarray object should modify their code to explicitly call np.asarray(obj, allow_duck_array=False). This would mean that for a while people who tried to pass duck-arrays into legacy library would get segfaults, but there would be a clear path for fixing these issues as they were discovered.
Either way, there are also some other details to figure out: how does this affect the C version of asarray? What about np.asfortranarray – probably that should default to allow_duck_array=False, even if we did make np.asarray default to allow_duck_array=True, right?
Now if I understand right, your proposal would be to make it so any code in any package could arbitrarily change the behavior of np.asarray for all inputs, e.g. I could just decide that np.asarray([1, 2, 3]) should return some arbitrary non-np.ndarray object. It seems like this has a much greater potential for breaking existing Cython/C/C++ code, and the NEP doesn't currently describe why this extra power is useful, and it doesn't currently describe how it plans to mitigate the downsides. (For example, if a caller needs a real np.ndarray, then is there some way to explicitly request one? The NEP doesn't say.) Maybe this is all fine and there are solutions to these issues, but any proposal to address duck array coercion needs to at least talk about these issues! I believe I addressed this in a previous email, but the NEP doesn't suggest overriding numpy.asarray or numpy.array. It suggests overriding numpy.overridable.asarray and numpy.overridable.array, so existing code will continue to work as-is and overrides are opt-in rather than forced on you.
The argument about this kind of code could be applied to return values from other functions as well. That said, there is a way to request a NumPy array object explicitly: with ua.set_backend(np): Â Â Â x = np.asarray(...)
And that's just one example... array coercion is a particularly central and tricky problem, but the numpy API big, and there are probably other problems like this. For another example, I don't understand what the NEP is proposing to do about dtypes at all.
Just as there are other kinds of arrays, there may be other kinds of dtypes that are not NumPy dtypes. They cannot be attached to a NumPy array object (as Sebastian pointed out to me in last week's Community meeting), but they can still provide other powerful features.
That's why I think the NEP needs to be fleshed out a lot more before it will be possible to evaluate fairly.
-n
I just pushed a new version of the NEP to my PR, the full-text of which
is below.
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
:Author: Hameer Abbasi
Hello everyone;
Thanks to all the feedback from the community, in particular Sebastian
Berg, we have a new draft of NEP-31.
Please find the full text quoted below for discussion and reference. Any
feedback and discussion is welcome.
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
:Author: Hameer Abbasi
Thanks to all the feedback, we have a new PR of NEP-31.
Please find the full-text quoted below:
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
:Author: Hameer Abbasi
Hi all, and thank you for all your hard work with this. I wanted to provide more of an "end user" perspective than I think has been present in this discussion so far. Over the past month, I've quickly skimmed some emails on this thread and skipped others altogether. I am far from a NumPy novice, but essentially *all* of the discussion went over my head. For a while my attitude was "Oh well, far smarter people than me are dealing with this, I'll let them figure it out." Looking at the participants in the thread, I worry that this is the attitude almost everyone has taken, and that the solution proposed will not be easy enough to deal with for any meaningful adoption. Certainly with `__array_function__` I only took interest when our tests broke with 1.17rc1. Today I was particularly interested because I'm working to improve scikit-image support for pyopencl.Array inputs. I went back and read the original NEP and the latest iteration. Thank you again for the discussion, because the latest is indeed a vast improvement over the original. I think the very motivation has the wrong focus. I would summarise it as "we've been coming up with all kinds of ways to do multiple dispatch for array-likes, and we've found that we need more ways, so let's come up with the One True Way." I think the focus should be on the users and community. Something along the lines of: "New implementations of array computing are cropping up left, right, and centre in Python (not to speak of other languages!). There are good reasons for this (GPUs, distributed computing, sparse data, etc), but it leaves users and library authors in a pickle: how can they ensure that their functions, written with NumPy array inputs and outputs in mind, work well in this ecosystem?" With this new motivation in mind, I think that the user story below is (a) the best part of the NEP, but (b) underdeveloped. The NEP is all about "if I want my array implementation to work with this fancy dispatch system, what do I need to do?". But there should be more of "in user situations X, Y, and Z, what is the desired behaviour?"
The way we propose the overrides will be used by end users is::
# On the library side
import numpy.overridable as unp
def library_function(array): array = unp.asarray(array) # Code using unumpy as usual return array
# On the user side:
import numpy.overridable as unp import uarray as ua import dask.array as da
ua.register_backend(da)
library_function(dask_array) # works and returns dask_array
with unp.set_backend(da): library_function([1, 2, 3, 4]) # actually returns a Dask array.
Here, ``backend`` can be any compatible object defined either by NumPy or an external library, such as Dask or CuPy. Ideally, it should be the module ``dask.array`` or ``cupy`` itself.
Some questions about the above: - What happens if I call `library_function(dask_array)` without registering `da` as a backend first? Will `unp.asarray` try to instantiate a potentially 100GB array? This seems bad. - To get `library_function`, I presumably have to do `from fancy_array_library import library_function`. Can the code in `fancy_array_library` itself register backends, and if so, should/would fancy array libraries that want to maximise compatibility pre-register a bunch of backends so that users don't have to? Here are a couple of code snippets that I would *want* to "just work". Maybe it's unreasonable, but imho the NEP should provide these as use cases (specifically: how library_function should be written so that they work, and what dask.array and pytorch would need to do so that they work, OR, why the NEP doesn't solve them). 1. from dask import array as da from fancy_array_library import library_function # hopefully skimage one day ;) data = da.from_zarr('myfile.zarr') result = library_function(data) # result should still be dask, all things being equal result.to_zarr('output.zarr') 2. from dask import array as da from magic_library import pytorch_predict data = da.from_zarr('myfile.zarr') result = pytorch_predict(data) # normally here I would use e.g. data.map_overlap, but could this be done magically? result.to_zarr('output.zarr') There's probably a whole bunch of other "user stories" one can concoct, and no doubt many from the authors of the NEP themselves, but they don't come through in the NEP text. My apologies that I haven't read *all* the references: I understand that it is frustrating if the above are addressed there, but I think it's important to have this kind of context in the NEP itself. Thank you again, and I hope the above is helpful rather than feels like more unnecessary churn. Juan.
On Wed, Oct 9, 2019 at 6:00 PM Juan Nunez-Iglesias
Hi all, and thank you for all your hard work with this.
I wanted to provide more of an "end user" perspective than I think has been present in this discussion so far. Over the past month, I've quickly skimmed some emails on this thread and skipped others altogether. I am far from a NumPy novice, but essentially *all* of the discussion went over my head. For a while my attitude was "Oh well, far smarter people than me are dealing with this, I'll let them figure it out." Looking at the participants in the thread, I worry that this is the attitude almost everyone has taken, and that the solution proposed will not be easy enough to deal with for any meaningful adoption. Certainly with `__array_function__` I only took interest when our tests broke with 1.17rc1.
Today I was particularly interested because I'm working to improve scikit-image support for pyopencl.Array inputs. I went back and read the original NEP and the latest iteration. Thank you again for the discussion, because the latest is indeed a vast improvement over the original.
I think the very motivation has the wrong focus. I would summarise it as "we've been coming up with all kinds of ways to do multiple dispatch for array-likes, and we've found that we need more ways, so let's come up with the One True Way." I think the focus should be on the users and community. Something along the lines of: "New implementations of array computing are cropping up left, right, and centre in Python (not to speak of other languages!). There are good reasons for this (GPUs, distributed computing, sparse data, etc), but it leaves users and library authors in a pickle: how can they ensure that their functions, written with NumPy array inputs and outputs in mind, work well in this ecosystem?"
With this new motivation in mind, I think that the user story below is (a) the best part of the NEP, but (b) underdeveloped. The NEP is all about "if I want my array implementation to work with this fancy dispatch system, what do I need to do?". But there should be more of "in user situations X, Y, and Z, what is the desired behaviour?"
The way we propose the overrides will be used by end users is::
# On the library side
import numpy.overridable as unp
def library_function(array): array = unp.asarray(array) # Code using unumpy as usual return array
# On the user side:
import numpy.overridable as unp import uarray as ua import dask.array as da
ua.register_backend(da)
library_function(dask_array) # works and returns dask_array
with unp.set_backend(da): library_function([1, 2, 3, 4]) # actually returns a Dask array.
Here, ``backend`` can be any compatible object defined either by NumPy or an external library, such as Dask or CuPy. Ideally, it should be the module ``dask.array`` or ``cupy`` itself.
Some questions about the above:
- What happens if I call `library_function(dask_array)` without registering `da` as a backend first? Will `unp.asarray` try to instantiate a potentially 100GB array? This seems bad. - To get `library_function`, I presumably have to do `from fancy_array_library import library_function`. Can the code in `fancy_array_library` itself register backends, and if so, should/would fancy array libraries that want to maximise compatibility pre-register a bunch of backends so that users don't have to?
Here are a couple of code snippets that I would *want* to "just work". Maybe it's unreasonable, but imho the NEP should provide these as use cases (specifically: how library_function should be written so that they work, and what dask.array and pytorch would need to do so that they work, OR, why the NEP doesn't solve them).
1. from dask import array as da from fancy_array_library import library_function # hopefully skimage one day ;)
data = da.from_zarr('myfile.zarr') result = library_function(data) # result should still be dask, all things being equal result.to_zarr('output.zarr')
2. from dask import array as da from magic_library import pytorch_predict
data = da.from_zarr('myfile.zarr') result = pytorch_predict(data) # normally here I would use e.g. data.map_overlap, but could this be done magically? result.to_zarr('output.zarr')
There's probably a whole bunch of other "user stories" one can concoct, and no doubt many from the authors of the NEP themselves, but they don't come through in the NEP text. My apologies that I haven't read *all* the references: I understand that it is frustrating if the above are addressed there, but I think it's important to have this kind of context in the NEP itself.
Thank you again, and I hope the above is helpful rather than feels like more unnecessary churn.
Thanks Juan, this feedback is amazing and I couldn't agree more. I think we have to have this "end user focus" for this NEP, as well as for other large-scope design efforts: we should do or have done this for __array_ufunc__, __array_function__, the dtype redesign, etc. I think in this case, the user stories and a "vision" on the whole topic don't belong inside this NEP. Rather, it should be a separate one like https://numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html. If you read the first paragraph of that NEP, it actually starts out exactly right in the detailed description. But then it dives straight into design. In my experience, when you mix user stories or external requirements with design, it's extremely easy to ignore or be super brief about the former, and let design considerations/details lead rather than follow from those external requirements. Note that this is also why I wanted to update the NEP template. We've done a tweak by adding the "Motivation and Scope" section, but that doesn't go nearly far enough. Back to this NEP: I don't think we should significantly extend it, we should write a new separate one. Rationale: these user stories apply equally to __array_function__ et al., and will have to guide how NumPy as a whole and the usage of the NumPy API evolves over the next couple of years. Cheers, Ralf
Thanks to all the feedback, we have a new PR of NEP-31.
Please find the full-text quoted below:
============================================================
NEP 31 — Context-local and global overrides of the NumPy API
============================================================
:Author: Hameer Abbasi
participants (11)
-
Eric Wieser
-
Hameer Abbasi
-
Hameer Abbasi
-
Juan Nunez-Iglesias
-
Nathan
-
Nathaniel Smith
-
Peter Bell
-
Ralf Gommers
-
sebastian
-
Sebastian Berg
-
Stephan Hoyer