Bug in numpy.mean() revisited
There was a thread in January discussing the non-obvious behavior of numpy.mean() for large arrays of float32 values [1]. This issue is nicely discussed at the end of the numpy.mean() documentation [2] with an example:
a = np.zeros((2, 512*512), dtype=np.float32) a[0, :] = 1.0 a[1, :] = 0.1 np.mean(a) 0.546875
From the docs and previous discussion it seems there is no technical difficulty in choosing a different (higher precision) type for the accumulator using the dtype arg, and in fact this is done automatically for int values.
My question is whether there would be any support for doing something more than documenting this behavior. I suspect very few people ever make it below the fold for the np.mean() documentation. Taking the mean of large arrays of float32 values is a *very* common use case and giving the wrong answer with default inputs is really disturbing. I recently had to rebuild a complex science data archive because of corrupted mean values. Possible ideas to stimulate discussion: 1. Always use float64 to accumulate float types that are 64 bits or less. Are there serious performance impacts to automatically using float64 to accumulate float32 arrays? I appreciate this would likely introduce unwanted regressions (sometimes suddenly getting the right answer is a bad thing). So could this be considered for numpy 2.0? 2. Might there be a way to emit a warning if the number of values and the max accumulated value [3] are such that the estimated fractional error is above some tolerance? I'm not even sure if this is a good idea or if there will be howls from the community as their codes start warning about inaccurate mean values. Better idea along this line?? Cheers, Tom [1]: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059960.html [2]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html [3]: Using the max accumulated value during accumulation instead of the final accumulated value seems like the right thing for estimating precision loss. But this would affect performance so maybe just using the final value would catch many cases.
On Thu, Jul 26, 2012 at 9:26 PM, Tom Aldcroft wrote: There was a thread in January discussing the non-obvious behavior of
numpy.mean() for large arrays of float32 values [1]. This issue is
nicely discussed at the end of the numpy.mean() documentation [2] with
an example: a = np.zeros((2, 512*512), dtype=np.float32)
a[0, :] = 1.0
a[1, :] = 0.1
np.mean(a)
0.546875 From the docs and previous discussion it seems there is no technical
difficulty in choosing a different (higher precision) type for the
accumulator using the dtype arg, and in fact this is done
automatically for int values. My question is whether there would be any support for doing something
more than documenting this behavior. I suspect very few people ever
make it below the fold for the np.mean() documentation. Taking the
mean of large arrays of float32 values is a *very* common use case and
giving the wrong answer with default inputs is really disturbing. I
recently had to rebuild a complex science data archive because of
corrupted mean values. Possible ideas to stimulate discussion:
1. Always use float64 to accumulate float types that are 64 bits or
less. Are there serious performance impacts to automatically using
float64 to accumulate float32 arrays? I appreciate this would likely
introduce unwanted regressions (sometimes suddenly getting the right
answer is a bad thing). So could this be considered for numpy 2.0? 2. Might there be a way to emit a warning if the number of values and
the max accumulated value [3] are such that the estimated fractional
error is above some tolerance? I'm not even sure if this is a good
idea or if there will be howls from the community as their codes start
warning about inaccurate mean values. Better idea along this line?? I would support accumulating in 64 bits but, IIRC, the function will need
to be rewritten so that it works by adding 32 bit floats to the accumulator
to save space. There are also more stable methods that could also be
investigated. There is a nice little project there for someone to cut their
teeth on.
Chuck
On Thu, 2012-07-26 at 22:15 -0600, Charles R Harris wrote:
I would support accumulating in 64 bits but, IIRC, the function will need to be rewritten so that it works by adding 32 bit floats to the accumulator to save space. There are also more stable methods that could also be investigated. There is a nice little project there for someone to cut their teeth on.
So a (very) quick read around suggests that using an interim mean gives a more robust algorithm. The problem being, that these techniques are either multi-pass, or inherently slower (due to say a division in the loop). Higher precision would not suffer the same potential slow down and would solve most cases of this problem. Henry
On Fri, Jul 27, 2012 at 5:15 AM, Charles R Harris
I would support accumulating in 64 bits but, IIRC, the function will need to be rewritten so that it works by adding 32 bit floats to the accumulator to save space. There are also more stable methods that could also be investigated. There is a nice little project there for someone to cut their teeth on.
So the obvious solution here would be to make the ufunc reduce loop smart enough that x = np.zeros(2 ** 30, dtype=float32) np.sum(x, dtype=float64) does not upcast 'x' to float64's as a whole. This shouldn't be too terrible to implement -- iterate over the float32 array, and only upcast each inner-loop "buffer" as you go, instead of upcasting the whole thing. In fact, nditer might do this already? Then using a wide accumulator by default would just take a few lines of code in numpy.core._methods._mean to select the proper dtype and downcast the result. -n
participants (4)
-
Charles R Harris
-
Henry Gomersall
-
Nathaniel Smith
-
Tom Aldcroft