big-bangs versus incremental improvements (was: Re: SciPy 2014 BoF NumPy Participation)

On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant travis@continuum.io wrote:
Even relatively simple changes can have significant impact at this point. Nathaniel has laid out a fantastic list of great features. These are the kind of features I have been eager to see as well. This is why I have been working to fund and help explore these ideas in the Numba array object as well as in Blaze. Gnumpy, Theano, Pandas, and other projects also have useful tales to tell regarding a potential NumPy 2.0.
I think this is somewhat missing the main point of my message :-). I was specifically laying out a list of features that we could start working on *right now*, *without* waiting for the mythical "numpy 2.0".
Ultimately, I do think it is time to talk seriously about NumPy 2.0, and what it might look like. I personally think it looks a lot more like a re-write, than a continuation of the modifications of Numeric that became NumPy 1.0. Right out of the gate, for example, I would make sure that NumPy 2.0 objects somehow used PyObject_VAR_HEAD so that they were variable-sized objects where the strides and dimension information was stored directly in the object structure itself instead of allocated separately (thus requiring additional loads and stores from memory). This would be a relatively simple change. But, it can't be done and preserve ABI compatibility. It may also, at this point, have impact on Cython code, or other code that is deeply-aware of the NumPy code-structure. Some of the changes that should be made will ultimately require a porting exercise for new code --- at which point why not just use a new project.
I'm not aware of any obstacles to packing strides/dimension/data into the ndarray object right now, tomorrow if you like -- we've even discussed doing this recently in the tracker. PyObject_VAR_HEAD in particular seems... irrelevant? All it is is syntactic sugar for adding an integer field called "ob_size" to a Python object struct, plus a few macros for working with this field. We don't need or want such a field anyway (for shape/strides it would be redundant with ndim), and even if we did want such a field we could add it any time without breaking ABI. And if someday we do discover some compelling advantage to breaking ABI by rearranging the ndarray struct, then we can do this with a bit of planning by using #ifdef's to make the rearrangement coincide with a new Python release. E.g., people building against python 3.5 get the new struct layout, people building against 3.4 get the old, and in a few years we drop support for the old. No compatibility breaks needed, never mind rewrites.
More generally: I wouldn't rule out "numpy 2.0" entirely, but we need to remember the immense costs that a rewrite-and-replace strategy will incur. Writing a new library is very expensive, so that's one cost. But that cost is nothing compared to the costs of getting that new library to the same level of maturity that numpy has already reached. And those costs, in turn, are absolutely dwarfed by the transition costs of moving the whole ecosystem from one foundation to a different, incompatible one. And probably even these costs are small compared to the opportunity costs -- all the progress that *doesn't* get made in the mean time because fragmented ecosystems suck and make writing code hard, and the best hackers are busy porting code instead of writing awesome new stuff. I'm sure dynd is great, but we have to be realistic: the hard truth is that even if it's production-ready today, that only brings us a fraction of a fraction of a percent closer to making it a real replacement for numpy.
Consider the python 2 to python 3 transition: Python 3 itself was an immense amount of work for a large number of people, with intense community scrutiny of the design. It came out in 2008. 6 years and many many improvements later, it's maybe sort-of starting to look like a plurality of users might start transitioning soonish? It'll be years yet before portable libraries can start taking advantage of python 3's new awesomeness. And in the mean time, the progress of the whole Python ecosystem has been seriously disrupted: think of how much awesome stuff we'd have if all the time that's been spent porting and testing different packages had been spent on moving them forward instead. We also have experience closer to home -- did anyone enjoy the numeric/numarray->numpy transition so much they want to do it again? And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-). And my biggest worry is that if anyone even tries to convince everyone to make this kind of transition, then if they're successful at all then they'll create a substantial period where the ecosystem is a big incompatible mess (and they might still eventually fail, providing no long-term benefit to make up for the immediate costs). This scenario is a nightmare for end-users all around.
By comparison, if we improve numpy incrementally, then we can in most cases preserve compatibility totally, and in the rare cases where it's necessary to break something we can do it mindfully, minimally, and with a managed transition. (Downstream packages are already used to handling a few limited API changes at a time, it's not that hard to support both APIs during the transition period, etc., so this way we bring the ecosystem with us.) Every incremental improvement to numpy immediately benefits its immense user base, and gets feedback and testing from that immense user base. And if we incrementally improve interoperability between numpy and other libraries like dynd, then instead of creating fragmentation, it will let downstream packages use both in a complementary way, switching back and forth depending on which provides more utility on a case-by-case basis. If this means that numpy eventually withers away because users vote with their feet, then great, that'd be compelling evidence that whatever they were migrating to really is better, which I trust a lot more than any guesses we make on a mailing list. The gradual approach does require that we be grown-ups and hold our noses while refactoring out legacy spaghetti and writing unaesthetic compatibility hacks. But if you compare this to the alternative... the benefits of incrementalism are, IMO, overwhelming.
The only exception is when two specific criteria are met: (1) there are changes that are absolutely necessary for the ecosystem's long term health (e.g., py3's unicode-for-mere-mortals and true division), AND (2) it's absolutely impossible to make these changes incrementally (unicode and true division first entered Python in 2000 and 2001, respectively, and immense effort went into finding the smoothest transition, so it's pretty clear that as painful as py3 has been, there isn't really anything better.).
What features could meet these two criteria in numpy's case? If I were the numpy ecosystem and you tried to convince me to suffer through a big-bang transition for the sake of PyObject_VAR_HEAD then I think I'd be kinda unconvinced. And it only took me a few minutes to rattle off a whole list of incremental changes that haven't even been tried yet.
-n

Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
Best,
-Travis

On Wed, Jun 4, 2014 at 7:29 PM, Travis Oliphant travis@continuum.io wrote:
Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases? Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
Chuck

On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris charlesr.harris@gmail.com wrote:
On Wed, Jun 4, 2014 at 7:29 PM, Travis Oliphant travis@continuum.io wrote:
Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
I think changing the ABI for some versions of numpy (2.0 , whatever) is acceptable. There is little doubt that the ABI will need to change to accommodate a better and more flexible architecture.
Changing the C API is more tricky: I am not up to date to the changes from the last 2-3 years, but at that time, most things could have been changed internally without breaking much, though I did not go far enough to estimate what the performance impact could be (if any).
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases? Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
Last time the C++ discussion was brought up, no consensus could be made. I think quite a few radical changes can be made without that consensus already, though other may disagree there.
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's the main bottleneck to get more contributors (or get new core features more quickly).
David

On Thu, Jun 5, 2014 at 8:40 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Wed, Jun 4, 2014 at 7:29 PM, Travis Oliphant travis@continuum.io wrote:
Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
I think changing the ABI for some versions of numpy (2.0 , whatever) is acceptable. There is little doubt that the ABI will need to change to accommodate a better and more flexible architecture.
Changing the C API is more tricky: I am not up to date to the changes from the last 2-3 years, but at that time, most things could have been changed internally without breaking much, though I did not go far enough to estimate what the performance impact could be (if any).
My impression is that you can do it once (in a while) so that no more than two incompatible versions of numpy are alive at the same time.
It doesn't look worse to me than supporting a new python version, but doubles the number of binaries and wheels.
(Supporting python 3.4 for cython based projects was hoping or helping that cython takes care of it. And cython developers took care of it. )
Josef
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases? Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
Last time the C++ discussion was brought up, no consensus could be made. I think quite a few radical changes can be made without that consensus already, though other may disagree there.
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's the main bottleneck to get more contributors (or get new core features more quickly).
David
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Jun 5, 2014 at 6:40 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Wed, Jun 4, 2014 at 7:29 PM, Travis Oliphant travis@continuum.io wrote:
Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
I think changing the ABI for some versions of numpy (2.0 , whatever) is acceptable. There is little doubt that the ABI will need to change to accommodate a better and more flexible architecture.
Changing the C API is more tricky: I am not up to date to the changes from the last 2-3 years, but at that time, most things could have been changed internally without breaking much, though I did not go far enough to estimate what the performance impact could be (if any).
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases? Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
Last time the C++ discussion was brought up, no consensus could be made. I think quite a few radical changes can be made without that consensus already, though other may disagree there.
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's the main bottleneck to get more contributors (or get new core features more quickly).
What do you mean by "extract the Python C API"?
Chuck

On Thu, Jun 5, 2014 at 2:51 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Thu, Jun 5, 2014 at 6:40 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Wed, Jun 4, 2014 at 7:29 PM, Travis Oliphant travis@continuum.io wrote:
Believe me, I'm all for incremental changes if it is actually possible and doesn't actually cost more. It's also why I've been silent until now about anything we are doing being a candidate for a NumPy 2.0. I understand the challenges of getting people to change. But, features and solid improvements *will* get people to change --- especially if their new library can be used along with the old library and the transition can be done gradually. Python 3's struggle is the lack of features.
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write". What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
One huge benefit that came out of the numeric / numarray / numpy transition that we mustn't forget about was actually the extended buffer protocol and memory view objects. This really does allow multiple array objects to co-exist and libraries to use the object that they prefer in a way that did not exist when Numarray / numeric / numpy came out. So, we shouldn't be afraid of that world. The existence of easy package managers to update environments to try out new features and have applications on a single system that use multiple versions of the same library is also something that didn't exist before and that will make any transition easier for users.
One thing I regret about my working on NumPy originally is that I didn't have the foresight, skill, and understanding to work more on a more extended and better designed multiple-dispatch system so that multiple array objects could participate together in an expression flow. The __numpy_ufunc__ mechanism gives enough capability in that direction that it may be better now.
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it. I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
I remain optimistic that things will continue to improve in multiple ways --- if a little "messier" than any of us would conceive individually. It *is* great to see all the PR's coming from multiple people on NumPy and all the new energy around improving things whether great or small.
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
I think changing the ABI for some versions of numpy (2.0 , whatever) is acceptable. There is little doubt that the ABI will need to change to accommodate a better and more flexible architecture.
Changing the C API is more tricky: I am not up to date to the changes from the last 2-3 years, but at that time, most things could have been changed internally without breaking much, though I did not go far enough to estimate what the performance impact could be (if any).
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases? Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
Last time the C++ discussion was brought up, no consensus could be made. I think quite a few radical changes can be made without that consensus already, though other may disagree there.
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's the main bottleneck to get more contributors (or get new core features more quickly).
What do you mean by "extract the Python C API"?
Poor choice of words: I meant extracting the lower level part of array/ufunc/etc... from its wrapping into the python C API (with the idea that the latter could be done in Cython, modulo improvements in cython to manage the binary/code size explosion).
IOW, split numpy into core and core-py (I think dynd benefits a lots from that, on top of its feature set).
David
Chuck
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Jun 5, 2014 at 3:24 PM, David Cournapeau cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 2:51 PM, Charles R Harris charlesr.harris@gmail.com wrote:
On Thu, Jun 5, 2014 at 6:40 AM, David Cournapeau cournape@gmail.com wrote:
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's the main bottleneck to get more contributors (or get new core features more quickly).
What do you mean by "extract the Python C API"?
Poor choice of words: I meant extracting the lower level part of array/ufunc/etc... from its wrapping into the python C API (with the idea that the latter could be done in Cython, modulo improvements in cython to manage the binary/code size explosion).
IOW, split numpy into core and core-py (I think dynd benefits a lots from that, on top of its feature set).
Can you give some examples of these benefits? I'm kinda wary of refactoring-for-the-sake-of-it -- IME usually it's easier, more valuable, and more fun to refactor in the process of making some concrete improvement.
Also, it's very much pie-in-the-sky at the moment, but if the pypy or numba or pyston compilers gained the ability to grok cython code directly, then having everything in cython instead of C could potentially allow for a single numpy code base to be shared between cpython and jitted-python, with the former working as it does now and the latter doing JIT loop fusion etc.
-n

On Thu, Jun 5, 2014 at 11:48 PM, Nathaniel Smith njs@pobox.com wrote:
On Thu, Jun 5, 2014 at 3:24 PM, David Cournapeau cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 2:51 PM, Charles R Harris <
charlesr.harris@gmail.com>
wrote:
On Thu, Jun 5, 2014 at 6:40 AM, David Cournapeau cournape@gmail.com wrote:
IMO, what is needed the most is refactoring the internal to extract the Python C API low level from the rest of the code, as I think that's
the main
bottleneck to get more contributors (or get new core features more
quickly).
What do you mean by "extract the Python C API"?
Poor choice of words: I meant extracting the lower level part of array/ufunc/etc... from its wrapping into the python C API (with the idea that the latter could be done in Cython, modulo improvements in cython to manage the binary/code size explosion).
IOW, split numpy into core and core-py (I think dynd benefits a lots from that, on top of its feature set).
Can you give some examples of these benefits?
numpy.core is difficult to approach as a codebase: it is big, and quite entangled. Why the concepts are sound, there is not much internal architecture. I would love for numpy to have a proper internal C API. I'd like to think my effort of splitting multiarray giant .c files into multiple files somehow made everybody's like easier.
A lot of the current code is python C API, which nobody cares about, and could be handled by e.g. cython (although again the feasibility of that needs to be discussed with the cython team, as cython cannot realisticallybe used pervasively for numpy ATM, see e.g. http://mail.scipy.org/pipermail/scipy-dev/2012-July/017717.html).
Such a separation would also be helpful for the pie-in-the-sky projects you mentioned. I think Wes Mc Kinney and the pandas team experience about using cython for the core vs just for the C<->Python integration would be useful there as well.
I'm kinda wary of
refactoring-for-the-sake-of-it -- IME usually it's easier, more valuable, and more fun to refactor in the process of making some concrete improvement.
Sure, I am just suggesting there should be a conscious effort to not just add features but also think about the internal consistency.
One concrete example is dtype and its pluggability: I would love to see things like datetime in a separate extension. It would keep us honest to allow people creating custom dtype (last time I've checked, there were some hardcoding that made it hard to do).
David

On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris charlesr.harris@gmail.com wrote:
@nathaniel IIRC, one of the objections to the missing values work was that it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be good to have general agreement that that is acceptable.
I can't think of reason why adding new variables to the structure *per se* would be objectionable to anyone? IIRC the objection you're thinking of wasn't to the existence of new variables, but to their effect on compatibility: their semantics meant that every piece of legacy C code that worked with ndarrays had to be updated to check for the new variables before it could correctly interpret the ->data field, and if it wasn't updated it would just return incorrect results. And there wasn't really a clear story for how we were going to detect and fix all this legacy code. This specific kind of compatibility break does seem pretty objectionable, but that's because of the details of its behaviour, not because variables in general are problematic, I think.
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a lot of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years and will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
I worry a bit that this could become a self-fulfilling prophecy. Plenty of software survives longer than that; the Linux kernel hasn't had a "real" major number increase [1] since 2.6.0, more than 10 years ago, and it's still an extraordinarily vital project. Partly this is because they have resources we don't etc., but partly it's just because they've decided that incremental change is how they're going to do things, and approached each new feature with that in mind. And in ten years they haven't yet found any features that required a serious compatibility break.
This is a pretty minor worry though -- we don't have to agree about what will happen in 10 years to agree about what to do now :-).
[1] http://www.pcmag.com/article2/0,2817,2388926,00.asp
The Python 3 experience is definitely something we want to avoid. And while blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this time, so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try to have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases?
I find it hard to discuss these things in general, since specific compatibility issues usually involve complicated trade-offs -- will every package have to recompile or just some of them, if they don't will it be a nice error message or a segfault, is there some way we can issue warnings ahead of time for the offending behaviour, etc. etc.
That said, my vote is that if there's a change that (a) can't be done some other way, (b) requires a recompile, (c) doesn't cause segfaults but rather produces some sensible error message like "ABI mismatch please recompile", (d) is a change that's worth the bother (this determination to include at least canvassing the list to check that users in general agree that it's worth it), then yeah we should do it. I don't anticipate that this will happen very often given how far we've gotten without it, but yeah.
It's possible we should be making a fuss now on distutils-sig about handling these cases in the brave new world of wheels, so that the relevant features have some chance of existing by the time we need them (e.g., 'pip upgrade numpy' should become smart enough to detect when this necessitates an upgrade of scipy).
Do we stay with C or should we support C++ code with its advantages of smart pointers, exception handling, and templates? We will need a certain amount of flexibility going forward and we should decide, or at least discuss, such issues up front.
This is an easier question, since it doesn't affect end-users at all (at least, so long as they have a decent toolchain available, but scipy already requires C++). Personally I'd like to see a more concrete plan for how exactly C++ would be used and why it's better than alternatives (as mentioned I have the vague idea that Cython would be even better), but I can't see why we should rule it out up front either.
-n

On Thu, Jun 5, 2014 at 11:41 PM, Nathaniel Smith njs@pobox.com wrote:
On Thu, Jun 5, 2014 at 3:36 AM, Charles R Harris charlesr.harris@gmail.com wrote:
@nathaniel IIRC, one of the objections to the missing values work was
that
it changed the underlying array object by adding a couple of variables to the structure. I'm willing to do that sort of thing, but it would be
good to
have general agreement that that is acceptable.
I can't think of reason why adding new variables to the structure *per se* would be objectionable to anyone? IIRC the objection you're thinking of wasn't to the existence of new variables, but to their effect on compatibility: their semantics meant that every piece of legacy C code that worked with ndarrays had to be updated to check for the new variables before it could correctly interpret the ->data field, and if it wasn't updated it would just return incorrect results. And there wasn't really a clear story for how we were going to detect and fix all this legacy code. This specific kind of compatibility break does seem pretty objectionable, but that's because of the details of its behaviour, not because variables in general are problematic, I think.
As to blaze/dynd, I'd like to steal bits here and there, and maybe in the long term base numpy on top of it with a compatibility layer. There is a
lot
of thought and effort that has gone into those projects and we should use what we can. As is, I think numpy is good for another five to ten years
and
will probably hang on for fifteen, but it will be outdated by the end of that period. Like great whites, we need to keep swimming just to have oxygen. Software projects tend to be obligate ram ventilators.
I worry a bit that this could become a self-fulfilling prophecy. Plenty of software survives longer than that; the Linux kernel hasn't had a "real" major number increase [1] since 2.6.0, more than 10 years ago, and it's still an extraordinarily vital project. Partly this is because they have resources we don't etc., but partly it's just because they've decided that incremental change is how they're going to do things, and approached each new feature with that in mind. And in ten years they haven't yet found any features that required a serious compatibility break.
This is a pretty minor worry though -- we don't have to agree about what will happen in 10 years to agree about what to do now :-).
[1] http://www.pcmag.com/article2/0,2817,2388926,00.asp
The Python 3 experience is definitely something we want to avoid. And
while
blaze does big data and offers some nice features, I don't know that it offers compelling reasons to upgrade to the more ordinary user at this
time,
so I'd like to sort of slip it into numpy if possible.
If we do start moving numpy forward in more radical steps, we should try
to
have some agreement beforehand as to what sort of changes are acceptable. For instance, to maintain backward compatibility, is it sufficient that a recompile will do the job, or do we require forward compatibility for extensions compiled against earlier releases?
I find it hard to discuss these things in general, since specific compatibility issues usually involve complicated trade-offs -- will every package have to recompile or just some of them, if they don't will it be a nice error message or a segfault, is there some way we can issue warnings ahead of time for the offending behaviour, etc. etc.
That said, my vote is that if there's a change that (a) can't be done some other way, (b) requires a recompile, (c) doesn't cause segfaults but rather produces some sensible error message like "ABI mismatch please recompile", (d) is a change that's worth the bother (this determination to include at least canvassing the list to check that users in general agree that it's worth it), then yeah we should do it. I don't anticipate that this will happen very often given how far we've gotten without it, but yeah.
Changing the ABI 'safely' (i.e. raise a python exception if changed) is already handled in numpy. We can always increase the ABI version if we think it is worth it
David

On Thu, Jun 5, 2014 at 2:29 AM, Travis Oliphant travis@continuum.io wrote:
At some point there *will* be a NumPy 2.0. What features go into NumPy 2.0, how much backward compatibility is provided, and how much porting is needed to move your code from NumPy 1.X to NumPy 2.X is the real user question --- not whether it is characterized as "incremental" change or "re-write".
There may or may not ever be a numpy 2.0. Maybe there will be a numpy 1.20 instead. Obviously there will be changes, and I think we generally agree on the end goal, but the question is how we get from here to there.
What I call a re-write and what you call an "incremental-change" are two points on a spectrum and likely overlap signficantly if we really compared what we are thinking about.
[...]
Ultimately, I don't disagree that NumPy can continue to exist in "incremental" change mode ( though if you are swapping out whole swaths of C-code for Cython code --- it sounds a lot like a "re-write") as long as there are people willing to put the effort into changing it.
This is why I'm trying to emphasize the a contrast between big-bang versus incremental, rather than rewrite-versus-not-rewrite. If Theseus goes through replacing every timber in his ship, and does it one at a time, then the boat still floats. If he tries to do it all at once, then the end goal may be the same but the actual results are rather different.
And perception matters. If we set out to design "numpy 2.0" then that conversation will go one way. If we set out to design "numpy 1.20", then the conversation will be different. I want to convince people that the numpy 1.20 approach is a worthwhile place to put our efforts.
I think this is actually benefited by the existence of other array objects that are pushing the feature envelope without the constraints --- in much the same way that the Python standard library is benefitted by many versions of different capabilities being tried out before moving into the standard library.
Indeed!
-n

On 5 Jun 2014 02:57, "Nathaniel Smith" njs@pobox.com wrote:
On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant travis@continuum.io
wrote:
And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-).
If numpy is really such a core part of python ecosystem, does it really make sense to keep it as a stand-alone package? Rather than thinking about a numpy 2, might it be better to be focusing on getting ndarray and dtype to a level of quality where acceptance upstream might be plausible?
Matlab and python are no longer the only games in town for scientific computing anymore. I worry that the lack of a multidimensional array literals, not to mention the lack of built-in multidimensional arrays at all, can only hurt python's attractiveness compared to languages like Julia long-term.
For people who already know and love python, this doesn't bother us much if at all. But thinking of attracting new users long-term, I worry that it will be harder to convince outsiders that python is really a first-class scientific computing language when it lacks the key data type for scientific computing.

On Thu, Jun 5, 2014 at 9:44 AM, Todd toddrjen@gmail.com wrote:
On 5 Jun 2014 02:57, "Nathaniel Smith" njs@pobox.com wrote:
On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant travis@continuum.io
wrote:
And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-).
If numpy is really such a core part of python ecosystem, does it really make sense to keep it as a stand-alone package? Rather than thinking about a numpy 2, might it be better to be focusing on getting ndarray and dtype to a level of quality where acceptance upstream might be plausible?
There has been discussions about integrating numpy a long time ago (can't find a reference right now), and the consensus was that this was possible in its current shape nor advisable. The situation has not changed.
Putting something in the stdlib means it basically cannot change anymore: API compatibility requirements would be stronger than what we provide even now. NumPy is also a large codebase which would need some major clean up to be accepted, etc...
David

On 5 Jun 2014 14:28, "David Cournapeau" cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 9:44 AM, Todd toddrjen@gmail.com wrote:
On 5 Jun 2014 02:57, "Nathaniel Smith" njs@pobox.com wrote:
On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant travis@continuum.io
wrote:
And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-).
If numpy is really such a core part of python ecosystem, does it really
make sense to keep it as a stand-alone package? Rather than thinking about a numpy 2, might it be better to be focusing on getting ndarray and dtype to a level of quality where acceptance upstream might be plausible?
There has been discussions about integrating numpy a long time ago (can't
find a reference right now), and the consensus was that this was possible in its current shape nor advisable. The situation has not changed.
Putting something in the stdlib means it basically cannot change anymore:
API compatibility requirements would be stronger than what we provide even now. NumPy is also a large codebase which would need some major clean up to be accepted, etc...
David
I am not suggesting merging all of numpy, only ndarray and dtype (which I know is a huge job itself). And perhaps not even all of what us currently included in those, some methods could be split out to their own functions.
And any numpy 2.0 would also imply a major code cleanup. So although ndarray and dtype are certainly not ready for such a thing right now, if you are talking about numpy 2.0 already, perhaps part of that discussion could involve a plan to get the code into a state where such a move might be plausible. Even if the merge doesn't actually happen, having the code at that quality level would still be a good thing.
I agree that the relationship between numpy and python has not changed very much in the last few years, but I think the scientific computing landscape is changing. The latter issue is where my primary concern lies.

On Thu, Jun 5, 2014 at 1:58 PM, Todd toddrjen@gmail.com wrote:
On 5 Jun 2014 14:28, "David Cournapeau" cournape@gmail.com wrote:
There has been discussions about integrating numpy a long time ago (can't find a reference right now), and the consensus was that this was possible in its current shape nor advisable. The situation has not changed.
I am not suggesting merging all of numpy, only ndarray and dtype (which I know is a huge job itself). And perhaps not even all of what us currently included in those, some methods could be split out to their own functions.
That is what was discussed and rejected in favor of putting the enhanced buffer protocol into the language.

On Thu, Jun 5, 2014 at 8:58 AM, Todd toddrjen@gmail.com wrote:
On 5 Jun 2014 14:28, "David Cournapeau" cournape@gmail.com wrote:
On Thu, Jun 5, 2014 at 9:44 AM, Todd toddrjen@gmail.com wrote:
On 5 Jun 2014 02:57, "Nathaniel Smith" njs@pobox.com wrote:
On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant travis@continuum.io
wrote:
And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-).
If numpy is really such a core part of python ecosystem, does it
really make sense to keep it as a stand-alone package? Rather than thinking about a numpy 2, might it be better to be focusing on getting ndarray and dtype to a level of quality where acceptance upstream might be plausible?
There has been discussions about integrating numpy a long time ago
(can't find a reference right now), and the consensus was that this was possible in its current shape nor advisable. The situation has not changed.
Putting something in the stdlib means it basically cannot change
anymore: API compatibility requirements would be stronger than what we provide even now. NumPy is also a large codebase which would need some major clean up to be accepted, etc...
David
I am not suggesting merging all of numpy, only ndarray and dtype (which I know is a huge job itself). And perhaps not even all of what us currently included in those, some methods could be split out to their own functions.
And any numpy 2.0 would also imply a major code cleanup. So although ndarray and dtype are certainly not ready for such a thing right now, if you are talking about numpy 2.0 already, perhaps part of that discussion could involve a plan to get the code into a state where such a move might be plausible. Even if the merge doesn't actually happen, having the code at that quality level would still be a good thing.
I agree that the relationship between numpy and python has not changed very much in the last few years, but I think the scientific computing landscape is changing. The latter issue is where my primary concern lies.
I don't think it would have any effect on scientific computing users. It might be useful for other users that occasionally want to do a bit of array processing.
Scientific users need the extended SciPy stack and not a part of numpy that can be imported from the standard library. For example in "Data Science", where I pay more attention and where Python is getting pretty popular, the usual recommended list requires numpy scipy and 5 to 10 more python libraries.
Should pandas also go into the python standard library? Python 3.4 got a statistics library, but I don't think it has any effect on the potential statsmodels user base.
Josef
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (7)
-
Charles R Harris
-
David Cournapeau
-
josef.pktd@gmail.com
-
Nathaniel Smith
-
Robert Kern
-
Todd
-
Travis Oliphant