Mailman 3 np.gradient - NumPy-Discussion

Oct. 18, 2014

      Okay! I think I now actually understand what was going on with
np.gradient! The discussion has been pretty confused, and I'm worried
that what's in master right now is not the right solution, which is a
problem because we're supposed to cut 1.9.1 tomorrow.

Background:

np.gradient computes gradients using a finite difference method.
Wikipedia has a good explanation of how this works:
https://en.wikipedia.org/wiki/Numerical_differentiation
The key point is that there are multiple formulas one can use that all
converge to the derivative, e.g.

   (f(x + h) - f(x)) / h

or

   (f(x + h) - f(x - h)) / 2h

The first is the textbook definition of the derivative. As h -> 0, the
error in that approximation shrinks like O(h). The second formula also
converges to the derivative as h -> 0, but it converges like O(h^2),
i.e., much faster. And there's are many many formulas like this with
different trade-offs:
   https://en.wikipedia.org/wiki/Finite_difference_coefficient
In practice, given a grid of values of f(x) and a grid stepsize of h,
all of these formulas come down to doing a simple convolution of the
data with certain coefficients (e.g. [-1/h, 1/h] or [-1/2h, 0, 1/2h]
for the two formulas above).

Traditionally np.gradient has used the second formula above, with its
quadratically diminishing error term, for interior points in the grid.
For points on the boundary of the grid, though, this formula has a
problem, b/c it requires you look at points to both the left and the
right of the current point -- but for boundary points one of these
doesn't exist. In such situations np.gradient has traditionally used
the first formula instead (the "forward (or backward) finite
difference approximation with linear error", where
"forward"/"backward" means that it works on the boundary). As the web
page linked above shows, though, there's an easy alternative formula
that works on

The change in numpy 1.9.0 that has caused all the fuss, is that we
switched to using the quadratically accurate approximation instead of
the linearly accurate approximation for boundary points.

This had two effects:

1) it produced (hopefully small) changes in the calculation for the
boundaries. In general these should have made the calculations more
accurate, but apparently this did cause problems for some people who
wanted perfectly reproducible output.

2) it tickled a nasty bug in how gradient and masked arrays work
together. It turns out gradient applied to masked arrays has always
been seriously buggy; there's some horribleness in the masked array
implementation that means the output array that np.gradient is writing
into ends up sharing a mask array with the input. So as we go along
calculating the output, we also mutate the input. This can definitely
corrupt the input, for example here:
    https://gist.github.com/njsmith/551738469b74d175e039
...and I strongly suspect it's possible to find cases where it means
you get the wrong answer.

For mysterious reasons involving the order in which values were
calculated or something, the version of np.gradient in 1.9.0 tickled
this bug in a much worse manner, with early mutations to the input
array ended up affecting later calculations using the same input
array, causing cascading errors. This is the cause of the massive
problems that matplotlib encountered in their test suite:
    https://github.com/matplotlib/matplotlib/issues/3598#issuecomment-58859663

There's a proposed fix for the masked array issues at:
    https://github.com/numpy/numpy/pull/5203
For the reasons described above, I would be very wary about using
*any* currently released version of np.gradient on masked arrays. A
workaround would be to make a local copy of the definition of
np.gradient, and immediately after the 'out' array is allocated do
'out.mask = out.mask.copy()'.

Once this bug is fixed, the "new" gradient code produces results which
are identical to the "old" gradient code on the interior; only the
boundary is different.

------

So here are my concerns:

- We decided to revert the changes to np.gradient in 1.9.1 (at least
by default). I'm not sure how much of that decision was based on the
issues with masked arrays, though, which turns out to be a different
issue entirely. The previous discussion conflated the two issues
above.

- We decided to gate the more-accurate boundary calculation behind a
kwarg called "edge_order=2". AFAICT now that I actually understand
what this code is doing, this is a terrible name -- we're using a
3-coefficient kernel to compute a quadratically accurate approximation
to a first-order derivative. There probably exist other kernels that
are also quadratically accurate. "Order 2" simply doesn't refer to
this in any unique or meaningful way. And it will be even more
confusing if we ever add the option to select which kernel to use on
the interior, where "quadratically accurate" is definitely not enough
information to uniquely define a kernel. Maybe we want something like
edge_accuracy=2 or edge_accuracy="quadratic" or something? I'm not
sure.

- If edge_order=2 escapes into a release tomorrow then we'll be stuck with it.

Some possible options for 1.9.1:
- Just revert the entire np.gradient code to match 1.8, and put off a
real fix until 1.9. This at least avoids getting stuck with a poor
API.

- Put the masked array bug fix into 1.9.1, and leave the gradient code
the same as 1.9.0; put off designing a proper API for selecting the
old behavior for 1.9.2. (Or 1.10 I guess.) This doesn't solve the pure
reproduceability problems, but it might be enough to make matplotlib
work again at least?

- Delay 1.9.1 to rehash the issues and make a decision on what we want
to support long-term.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

np.gradient

Nathaniel Smith

Matthew Brett

Nathaniel Smith

Charles R Harris

Nathaniel Smith

Charles R Harris

tags

participants (3)