The future of ndarray.diagonal()

A discussion [1] is currently underway at GitHub which will benefit from a larger forum. In version 1.9, the diagonal() method was changed to return a read-only (non-contiguous) view into the original array instead of a plain copy. Also, it has been announced [2] that in 1.10 the view will become read/write. A concern has now been raised [3] that this change breaks backward compatibility too much. Consider the following code: x = numy.eye(2) d = x.diagonal() d[0] = 2 In 1.8, this code runs without errors and results in [2, 1] stored in array d. In 1.9, this is an error. With the current plan, in 1.10 this will become valid again, but the result will be different: x[0,0] will be 2 while it is 1 in 1.8. Two alternatives are suggested for discussion: 1. Add copy=True flag to diagonal() method. 2. Roll back 1.9 change to diagonal() and introduce an additional diagonal_view() method to return a view. [1] https://github.com/numpy/numpy/pull/5409 [2] http://docs.scipy.org/doc/numpy/reference/generated/numpy.diagonal.html [3] http://khinsen.wordpress.com/2014/09/12/the-state-of-numpy/

On 1 Jan 2015 21:35, "Alexander Belopolsky" <ndarray@mac.com> wrote:
A discussion [1] is currently underway at GitHub which will benefit from
a larger forum.
In version 1.9, the diagonal() method was changed to return a read-only
(non-contiguous) view into the original array instead of a plain copy. Also, it has been announced [2] that in 1.10 the view will become read/write.
A concern has now been raised [3] that this change breaks backward
compatibility too much.
array d. In 1.9, this is an error. With the current plan, in 1.10 this will become valid again, but the result will be different: x[0,0] will be 2 while it is 1 in 1.8. Further context: In 1.7 and 1.8, the code above works as described, but also issues a visible-by-default warning:
1.7 was released in Feb. 2013, ~22 months ago. (I'm not implying this number is particularly large or small, it's just something that I find useful to calculate when thinking about things like this.) The choice of "1.10" as the target for completing this change is more-or-less a strawman and we shouldn't feel bound by it. The schedule was originally written in between the 1.6 and 1.7 releases, when our release process was kinda broken and we had no idea what the future release schedule would look like (1.6 -> 1.7 ultimately ended up being a ~21 month gap). We've already adjusted the schedule for this deprecation once before (see issue #596: The original schedule called for the change to returning a ro-view to happen in 1.8, rather than 1.9 as it actually did). Now that our release frequency is higher, 1.11 might well be a more reasonable target than 1.10. As for the overall question, this is really a bigger question about what strategy we should use in general to balance between conservatism (which is a Good Thing) and making improvements (which is also a Good Thing). The post you cite brings this up explicitly:
[3] http://khinsen.wordpress.com/2014/09/12/the-state-of-numpy/
I have huge respect for the problems and pain that Konrad describes in this blog post, but I really can't agree with the argument or the conclusions. His conclusion is that when it comes to compatibility breaks, slow-incremental-change is bad, and that we should instead prefer big all-at-once compatibility breaks like the Numeric->Numpy or Py2->Py3 transitions. But when describing his own experiences that he uses to motivate this, he says: *"The two main dependencies of my code, NumPy and Python itself, did sometimes introduce incompatible changes (by design or as consequences of bug fixes) that required changes on my own code base, but they were surprisingly minor and never required more than about a day of work."* i.e., slow-incremental-change has actually worked well in his experience. (And in particular, the np.diagonal issue only comes in as an example to illustrate what he means by the phrase "slow continuous change" -- this particular change hasn't actually broken anything in his code.) OTOH the big problem that motivated his post was that his code is all written against the APIs of the ancient and long-abandoned Numeric project, and he finds the costs of transitioning them to the "new" numpy APIs to be prohibitively expensive, i.e. this big-bang transition broke his code. (It did manage to limp on for some years b/c numpy used to contain some compatibility code to emulate the Numeric API, but this doesn't really change the basic situation: there were two implementations of the API he needed -- numpy.numeric and Numeric itself -- and both implementations still exist in the sense that you can download them, but neither is usable because no-one's willing to maintain them anymore.) Maybe I'm missing something, but his data seems to be pi radians off from his conclusion. -n

Wasn't all of this discussed way back when the deprecation plan was made? This was known to happen and was entirely the intent, right? What new argument is there to deviate from the plan? As for that particular blog post, I remember reading it back when it was posted. I, again, sympathize with the author's plight, but I pointed out that the reason for some of the changes he noted was because they could cause bugs, which would mean that results could be wrong. Reproducibility is nigh useless without a test suite to ensure the component parts are reproducible on their own. OTOH, there is an argument for slow, carefully-considered changes to APIs (which I think the diagonal() changes were). As an example of a potentially poor change is in matplotlib. We are starting to move to using properties, away from get/setters(). In my upcoming book, I ran into a problem where I needed to use an Artist's get_axes() or use its property "axes", but there will only be one release of matplotlib where both of them will be valid. I was faced with either using the get_axes() and have my code obsolete sometime in the summer, use the propery, and have my code invalid for all but the most recent version of matplotlib, or to have some version checking code that would distract from the lesson at hand. I now think that a single release cycle for deprecation of get_axes() was not a wise decision, especially since the old code was merely verbose, not buggy. To conclude, unless someone can present a *new* argument to deviate from the diagonal() plan that was set a couple years ago, I don't see any reason why the decisions that were agreed upon then are invalid now. The pros-and-cons were weighed, and this particular con was known then and was considered acceptable at that time. Cheers! Ben Root On Sat, Jan 3, 2015 at 2:49 PM, Nathaniel Smith <njs@pobox.com> wrote:

On 03/01/15 20:49, Nathaniel Smith wrote:
There are two different scenarios to consider here, and perhaps I didn't make that distinction clear enough. One scenario is that of a maintained library or application that depends on NumPy. The other scenario is a set of scripts written for a specific study (let's say a thesis) that is then published for its documentation value. Such scripts are in general not maintained. In the first scenario, gradual API changes work reasonably well, as long as the effort involved in applying the fixes are sufficiently minor that developers can integrate them into their routine maintenance efforts. That is the situation I have described for my own past experience as a library author. It's the second scenario where gradual changes are a real problem. Suppose I have a set of scripts from a thesis published in year X, and I need to understand them in detail in year X+5 for a related scientific project. If the script produces different results with NumPy 1.7 and NumPy 1.10, which result should I assume the author intended? People rarely write down which versions of all dependencies they used. Yes, they should, but it's actually a pain to do this, in particular when you work on multiple machines and don't manage the Python installation yourself. In this rather frequent situation, the published scripts are ambiguous - I can't really know what they did when the author ran them. There is a third scenario where this problem shows up: outdated legacy system installations, which are particularly frequent on clusters and supercomputer. For example, the cluster at my lab runs a CentOS version that is a few years old. CentOS is known for its conservatism, and therefore the default Python installation on that machine is based on Python 2.6 with correspondingly old NumPy versions. People do install recent application libraries there. Suppose someone runs code there that assumes the future semantics for diagonal() - this will silently yield wrong results. In summary, my point of view on breaking changes is 1) Changes that can make legacy code fail can be introduced gradually. The right compromise between stability and progress must be figured out by the community. 2) Changes that yield to different results for unmodified legacy code should never be allowed. 3) The best overall solution for API evolution is a version number visible in client code with a version number change whenever some breaking change is introduced. This guarantees point 2). So if the community decides that it is important to change the behavior of diagonal(), this should be done in one of two ways: a) Deprecate diagonal() and introduce a differently-named method with the new functionality. This will make old code fail rather than produce wrong results. b) Accumulate this change with other such changes and call the new API "numpy2". Konrad.

On 04/01/15 17:22, Konrad Hinsen wrote:
It's the second scenario where gradual changes are a real problem.
A scientific paper or thesis should be written so it is completely reproducible. That would include describing the computer, OS, Python version and NumPy version, as well as C or Fortran compiler. I will happily fail any student who writes a thesis without providing such details, and if I review a research paper for a journal you can be sure I will ask that is corrected. Sturla

On 04/01/15 21:28, Sturla Molden wrote:
I completely agree and we should all work towards this goal. But we aren't there yet. Most of the scientific community is just beginning to realize that there is a problem. Anyone writing scientific software for use in today's environment has to take this into account. More importantly, there is not only the technical problem of reproducibility, but also the meta-level problem of human understanding. Scientific communication depends more and more on scripts as the only precise documentation of a computational method. Our programming languages are becoming a major form of scientific notation, alongside traditional mathematics. Humans don't read written text with version numbers in mind. This is a vast problem which can't be solved merely by "fixing" software technology, but it's something to keep in mind nevertheless when writing software. For those interested in this aspect, I have written a much more detailed account in a recent paper: http://dx.doi.org/10.12688/f1000research.3978.2 Konrad.

Konrad Hinsen <konrad.hinsen@fastmail.net> wrote:
To me it seems that algorithms in scientific papers and books are described in various forms of pseudo-code. Perhaps we need a notation which is universal and ethernal like the language mathematics. But I am not sure Python could or should try to be that "scripting" language. I also think it is reasonable to ask if journals should require code as algorithmic documentation to be written in some ISO standard language like C or Fortran 90. The behavior of Python and NumPy are not dictated by standards, and as such is not better than pseudo-code. Sturla

--On 5 janvier 2015 08:43:45 +0000 Sturla Molden <sturla.molden@gmail.com> wrote:
To me it seems that algorithms in scientific papers and books are described in various forms of pseudo-code.
That's indeed what people do when they write a paper about an algorithm. But many if not most algorithms in computational science are never published in a specific article. Very often, a scientific article gives only an outline of a method in plain English. The only full documentation of the method is the implementation.
Neither Python nor any other programming was designed for that task, and none of them is really a good fit. But today's de facto situation is that programming languages fulfill the role of algorithmic specification languages in computational science. And I don't expect this to change rapidly, in particular because to the best of my knowledge there is no better choice available at the moment. I wrote an article on this topic that will appear in the March 2015 issue of "Computing in Science and Engineering". It concludes that for now, a simple Python script is probably the best you can do for an executable specification of an algorithm. However, I also recommend not using big libraries such as NumPy in such scripts.
True, but the ISO specifications of C and Fortran have so many holes ("undefined behavior") that they are not really much better for the job. And again, we can't ignore the reality of the de facto use today: there are no such requirements or even guidelines, so Python scripts are often the best we have as algorithmic documentation. Konrad.

On Mon, Jan 5, 2015 at 4:08 AM, Konrad Hinsen <konrad.hinsen@fastmail.net> wrote:
Matlab is more "well defined" than numpy. numpy has too many features. I think, if you want a runnable python script as algorithmic documentation, then it will be necessary and relatively easy in most cases to stick to the "stable" basic features. The same for a library, if we want to minimize compatibility problems, then we shouldn't use features that are most likely a moving target. One of the issues is whether we want to write "safe" or "fancy" code. (Fancy code might or will be faster, with a specific version.) For example in most of my use cases having a view or copy of an array makes a difference to the performance but not the results. I didn't participate in the `diagonal` debate because I don't have a strong opinion and don't use it with an assignment. There is an explicit np.fill_diagonal that is inplace. Having views or copies of arrays never sounded like having a clear cut answer, there are too many functions that "return views if possible". When our (statsmodels) code correctness depends on whether it's a view or copy, then we usually make sure and write the matching unit tests. Other cases, the behavior of numpy in edge cases like empty arrays is still in flux. We usually try to avoid relying on implicit behavior. Dtypes are a mess (in terms of code compatibility). Matlab is much nicer, it's all just doubles. Now pandas and numpy are making object arrays popular and introduce strange things like datetime dtypes, and users think a program written a while ago can handle them. Related compatibility issue python 2 and python 3: For non-string manipulation scientific code the main limitation is to avoid version specific features, and decide when to use lists versus iterators for range, zip, map. Other than that, it looks much simpler to me than expected. Overall I think the current policy of incremental changes in numpy works very well. Statsmodels needs a few minor adjustments in each version. But most of those are for cases where numpy became more strict or where we used a specific behavior in edge cases, AFAIR. One problem with accumulating changes for a larger version change like numpy 2 or 3 or 4 is to decide what changes would require this. Most changes will break some code, if the code requires or uses some exotic or internal behavior. If we want to be strict, then we don't change the policy but change the version numbers, instead of 1.8, 1.9 we have numpy 18 and numpy 19. However, from my perspective none of the recent changes were fundamental enough. BTW: Stata is versioning scripts. Each script can define for which version of Stata it was written, but I have no idea how they handle the compatibility issues. It looks to me that it would be way too much work to do something like this in an open source project. Legacy cleanups like removal of numeric compatibility in numpy or weave (and maxentropy) in scipy have been announced for a long time, and eventually all legacy code needs to run in a legacy environment. But that's a different issue from developing numpy and the current scientific python related packages which need the improvements. It is always possible just to "freeze" a package, with it's own frozen python and frozen versions of dependencies. Josef

On 1/5/2015 10:48 AM, josef.pktd@gmail.com wrote:
Dtypes are a mess (in terms of code compatibility). Matlab is much nicer, it's all just doubles.
1. Thank goodness for dtypes. 2. http://www.mathworks.com/help/matlab/numeric-types.html 3. After translating Matlab code to much nicer NumPy, I cannot find any way to say MATLAB is "nicer". Cheers, Alan

On Mon, Jan 5, 2015 at 11:13 AM, Alan G Isaac <alan.isaac@gmail.com> wrote:
Maybe it's my selection bias in matlab, I only wrote or read code in matlab that used exclusively double. Of course they are a necessary and great feature. However, life and code would be simpler if we could just do x = np.asarray(x, float) or even x = np.array(x, float) at the beginning of every function, instead of worrying why a user doesn't have float and trying to accommodate that choice. https://github.com/statsmodels/statsmodels/search?q=dtype&type=Issues&utf8=%E2%9C%93 AFAIK, matlab and R still have copy on write, so they don't have to worry about inplace modifications. 5 lines of code to implement an algorithm, and 50 lines of code for input checking. My response was to the issue of code as algorithmic documentation: There are packages or code supplements to books that come with the disclaimer that the code is written for educational purposes, to help understand the algorithm, but is not designed for efficiency or performance or generality. The more powerful the language and the "fancier" the code, the larger is the maintenance and wrapping work. another example: a dot product of a float/double 2d array is independent of any numpy version, and it will produce the same result in numpy 19.0 (except for different machine precision rounding errors) a dot product of an array (without dtype and shape restriction) might be anything and change within a few numpy versions. Josef

Hi, On Sun, Jan 4, 2015 at 4:22 PM, Konrad Hinsen <konrad.hinsen@fastmail.net> wrote:
2) Changes that yield to different results for unmodified legacy code should never be allowed.
I think this is a very reasonable rule. Case in point - I have some fairly old code in https://github.com/matthew-brett/transforms3d . I haven't updated this code since 2011. Now I test it, I get the following warning: ====================================================================== ERROR: Failure: ValueError (assignment destination is read-only) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/mb312/.virtualenvs/scipy-devel/lib/python2.7/site-packages/nose/loader.py", line 251, in generate for test in g(): File "/Users/mb312/dev_trees/transforms3d/transforms3d/tests/test_affines.py", line 74, in test_rand_de_compose T, R, Z, S = func(M) File "/Users/mb312/dev_trees/transforms3d/transforms3d/affines.py", line 298, in decompose Z[0] *= -1 ValueError: assignment destination is read-only If I had waited until 1.10 (or whatever) - I would have had to hope that my tests were good enough to pick this up, otherwise anyone using this code would be subject to some very strange bugs. Cheers, Matthew

Hello everyone, I just wanted to highlight the point made by Charles, it would be great if he would clarify any mistakes in the points that I put forward. Quoting the documentation, In versions of NumPy prior to 1.7, this function always returned a new,independent array containing a copy of the values in the diagonal. In NumPy 1.7 and 1.8, it continues to return a copy of the diagonal,but depending on this fact is deprecated. Writing to the resulting array continues to work as it used to, but a FutureWarning is issued. In NumPy 1.9 it returns a read-only view on the original array. Attempting to write to the resulting array will produce an error. In NumPy 1.10, it will return a read/write view, Writing to the returned array will alter your original array. Though the expected behaviour has its pros and cons,the points put forward are : 1. revert the changes so that *PyArray_Diagonal *returns a *copy.* 2. introduce new API function *PyArray_Diagonal2, *which has a *copy *argument, so that copy or view can be returned. 3. if a *view* is to be returned, its *write-ability *depends on whether the *input* is writeable. 4. implement *PyArray_Diagonal *in terms of the new function, thought the default value of *copy *is unsure. 5. Raise a *FutureWarning*, when trying to write to the result. 6. add *copy *argument to the *diagonal *function and method, updating the function in *methods.c *and *fromnumeric.py, *probably in other places also. 7. Also update the release notes and documentation. I would love to do the PR once a decision is reached. Cheers, N.Maniteja. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Thank you Charles for the corrections. Cheers, N.Maniteja _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion On Thu, Jan 15, 2015 at 10:41 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

On 04/01/15 21:55, Sturla Molden wrote:
I am not asking for "big-bang transitions" as such. I am asking for breaking changes to go along with a clearly visible and clearly announced change in the API name and/or major version. A change as important as dropping support for an API that has been around for 20 years shouldn't happen as one point in the change list from version 1.8 to 1.9. It can happen in the transition from "numpy" to "numpy2", which ideally should be done in a way that permits users to install both "numpy" and "numpy2" in parallel to ease the transition. There is a tacit convention in computing that "higher" version numbers of a package indicate improvements and extensions but not reduction in functionality. This convention also underlies most of today's package management systems. Major breaking changes violate this tacit convention.
The question of reproducible research is orthogonal to this, I think.
Indeed. My blog post addresses two distinct issues, whose common point is that they relate to the evolution of NumPy. Konrad.

Hi, On Thu, Jan 1, 2015 at 9:35 PM, Alexander Belopolsky <ndarray@mac.com> wrote:
I think this point is a good one, from Konrad Hinsen's blog post: <quote> If you get a Python script, say as a reviewer for a submitted article, and see “import numpy”, you don’t know which version of numpy the authors had in mind. If that script calls array.diag() and modifies the return value, does it expect to modify a copy or a view? The result is very different, but there is no way to tell. It is possible, even quite probable, that the code would execute fine with both NumPy 1.8 and the upcoming NumPy 1.10, but yield different results. </quote> That rules out the current 1.10 plan I think. copy=True as default seems like a nice compact and explicit solution to me. Cheers, Matthew

On Sat, Jan 3, 2015 at 2:55 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Bear in mind that this also affects the C-API via the PyArray_Diagonal function, so the rollback proposal would be 1) Roll back the change to PyArray_Diagonal 2) Introduce a new C-API function PyArray_Diagonal2 that has a 'copy' argument 3) Make PyArray_Diagonal call PyArray_Diagonal2 with 'copy=1' 4) Add a copy argument to do the diagonal method. I'm thinking we should have a rule that functions in the C-API can be refactored or deprecated, but they don't change otherwise. Chuck

On Sat, Jan 3, 2015 at 11:08 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
I think maybe making the change in 1.10 is too quick, but it doesn't rule it out long-term. This issue and the copy=True alternative were extensively discussed when making the change: http://thread.gmane.org/gmane.comp.python.numeric.general/49887/focus=49888 It's not impossible that we made the wrong decision a while back, but rehashing that whole discussion based on an argument that was already brought up back then doesn't sound good to me.
Makes sense. It's time to document the policy on deprecations and incompatible changes in more detail I think. We had a few sentences long statement on this on the Trac wiki, IIRC written by Robert Kern, but that's gone now. Do we have anything else written down anywhere? Ralf
participants (11)
-
Alan G Isaac
-
Alexander Belopolsky
-
Benjamin Root
-
Charles R Harris
-
josef.pktd@gmail.com
-
Konrad Hinsen
-
Maniteja Nandana
-
Matthew Brett
-
Nathaniel Smith
-
Ralf Gommers
-
Sturla Molden