numpy.mean still broken for large float32 arrays

Hi!
The following is a known "bug" since at least 2010 [1]:
import numpy as np X = np.ones((50000, 1024), np.float32) print X.mean() >>> 0.32768
I ran into this for the first time today as part of a larger program. I was very surprised by this, and spent over an hour looking for bugs in my code before noticing that the culprit was `mean` being broken for large float32 arrays. I realize that this behavior is actually documented, but it is absolutely non-intuitive. I assume most users expect `mean` to just work.
This has been discussed once two years ago [2], but nothing came of that. This could be easily fixed by making `np.float64` the default dtype (as it already is for integer types), or by at least checking inside mean if the passed array was a large np.float32 array and switch the dtype to np.float64 in that case. Is there a reason why this has not been done?
Cheers
Thomas
[1] http://mail.scipy.org/pipermail/numpy-discussion/2010-November/053697.html [2] http://numpy-discussion.10968.n7.nabble.com/Bug-in-numpy-mean-revisited-td12...

Arguably, this isn't a problem of numpy, but of programmers being trained to think of floating point numbers as 'real' numbers, rather than just a finite number of states with a funny distribution over the number line. np.mean isn't broken; your understanding of floating point number is.
What you appear to wish for is a silent upcasting of the accumulated result. This is often performed in reducing operations, but I can imagine it runs into trouble for nd-arrays. After all, if I have a huge array that I want to reduce over a very short axis, upcasting might be very undesirable; it wouldn't buy me any extra precision, but it would increase memory use from 'huge' to 'even more huge'.
np.mean has a kwarg that allows you to explicitly choose the dtype of the accumulant. X.mean(dtype=np.float64)==1.0. Personally, I have a distaste for implicit behavior, unless the rule is simple and there really can be no negative downsides; which doesn't apply here I would argue. Perhaps when reducing an array completely to a single value, there is no harm in upcasting to the maximum machine precision; but that becomes a rather complex rule which would work out differently for different machines. Its better to be confronted with the limitations of floating point numbers earlier, rather than later when you want to distribute your work and run into subtle bugs on other peoples computers.

I don't agree. The problem is that I expect `mean` to do something reasonable. The documentation mentions that the results can be "inaccurate", which is a huge understatement: the results can be utterly wrong. That is not reasonable. At the very least, a warning should be issued in cases where the dtype might not be appropriate.
One cannot predict what input sizes a program will be run with once it's in use (especially if it's in use for several years). I'd argue this is true for pretty much every code except quick one-off scripts. Thus one would have to use `dtype=np.float64` everywhere. By which point it seems obvious that it should have been the default in the first place. The other alternative would be to extend np.mean with some logic that internally figures out the right thing to do (which I don't think is too hard, since ).
Your example with the short axis is something that can be checked for. I agree that the logic could become a bit hairy, but not too much: If we are going to sum up more than N values (where N could be determined at compile time, or simply be some constant), we upcast unless the user explicitly specified a dtype. Of course, this would incur an increase in memory. However I'd argue that it's not even a large increase: If you can fit the matrix in memory, then allocating a row/column of float64 instead of float32 should be doable, as well. And I'd much rather get an OutOfMemory execption than silently continue my calculations with useless/wrong results.
Cheers
Thomas
On 2014-07-24 11:59, Eelco Hoogendoorn wrote:
Arguably, this isn't a problem of numpy, but of programmers being trained to think of floating point numbers as 'real' numbers, rather than just a finite number of states with a funny distribution over the number line. np.mean isn't broken; your understanding of floating point number is.
What you appear to wish for is a silent upcasting of the accumulated result. This is often performed in reducing operations, but I can imagine it runs into trouble for nd-arrays. After all, if I have a huge array that I want to reduce over a very short axis, upcasting might be very undesirable; it wouldn't buy me any extra precision, but it would increase memory use from 'huge' to 'even more huge'.
np.mean has a kwarg that allows you to explicitly choose the dtype of the accumulant. X.mean(dtype=np.float64)==1.0. Personally, I have a distaste for implicit behavior, unless the rule is simple and there really can be no negative downsides; which doesn't apply here I would argue. Perhaps when reducing an array completely to a single value, there is no harm in upcasting to the maximum machine precision; but that becomes a rather complex rule which would work out differently for different machines. Its better to be confronted with the limitations of floating point numbers earlier, rather than later when you want to distribute your work and run into subtle bugs on other peoples computers.
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Le 24/07/2014 12:55, Thomas Unterthiner a écrit :
I don't agree. The problem is that I expect `mean` to do something reasonable. The documentation mentions that the results can be "inaccurate", which is a huge understatement: the results can be utterly wrong. That is not reasonable. At the very least, a warning should be issued in cases where the dtype might not be appropriate.
Maybe the problem is the documentation, then. If this is a common error, it could be explicitly documented in the function documentation.

Hi all,
On 24.07.2014 11:59, Eelco Hoogendoorn wrote:
np.mean isn't broken; your understanding of floating point number is.
I am quite new to python, and this problem is discussed over and over for other languages too. However, numpy's summation problem appears with relatively small arrays already:
py>import numpy as np py>np.ones((4000,4000), np.float32).mean() 1.0 py>np.ones((5000,5000), np.float32).mean() 0.67108864000000001
A 5000*5000 image is not unusual anymore today.
In IDL: IDL> mean(fltarr(5000L, 5000L)+1) 1.0000000 IDL> mean(fltarr(7000L, 7000L)+1) 1.0000000 IDL> mean(fltarr(10000L, 10000L)+1) 0.67108864
I can't really explain why there are differences between the two languages (IDL uses 32-bit, single-precision, floating-point numbers)
Fabien

On Thu, Jul 24, 2014 at 1:33 PM, Fabien fabien.maussion@gmail.com wrote:
Hi all,
On 24.07.2014 11:59, Eelco Hoogendoorn wrote:
np.mean isn't broken; your understanding of floating point number is.
I am quite new to python, and this problem is discussed over and over for other languages too. However, numpy's summation problem appears with relatively small arrays already:
py>import numpy as np py>np.ones((4000,4000), np.float32).mean() 1.0 py>np.ones((5000,5000), np.float32).mean() 0.67108864000000001
A 5000*5000 image is not unusual anymore today.
In IDL: IDL> mean(fltarr(5000L, 5000L)+1) 1.0000000 IDL> mean(fltarr(7000L, 7000L)+1) 1.0000000 IDL> mean(fltarr(10000L, 10000L)+1) 0.67108864
I can't really explain why there are differences between the two languages (IDL uses 32-bit, single-precision, floating-point numbers)
Fabien
something as simple as summation is already an interesting algorithmic problem there are several ways do to with different speeds and accuracies. E.g. pythons math.fsum is always exact to one ulp but is very slow as it requires partial sorting. Then there is kahan summation that has an accuracy of O(1) ulp but its about 4 times slower than the naive sum. In practice one of the better methods is pairwise summation that is pretty much as fast as a naive summation but has an accuracy of O(logN) ulp. This is the method numpy 1.9 will use this method by default (+ its even a bit faster than our old implementation of the naive sum): https://github.com/numpy/numpy/pull/3685
but it has some limitations, it is limited to blocks fo the buffer size (8192 elements by default) and does not work along the slow axes due to limitations in the numpy iterator.

On Thu, Jul 24, 2014 at 4:56 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
In practice one of the better methods is pairwise summation that is pretty much as fast as a naive summation but has an accuracy of O(logN) ulp. This is the method numpy 1.9 will use this method by default (+ its even a bit faster than our old implementation of the naive sum): https://github.com/numpy/numpy/pull/3685
but it has some limitations, it is limited to blocks fo the buffer size (8192 elements by default) and does not work along the slow axes due to limitations in the numpy iterator.
For what it's worth, I see the issue on a 64-bit Windows numpy 1.8, but cannot on a 32-bit Windows numpy master:
np.__version__
'1.8.0'
np.ones(100000000, dtype=np.float32).mean()
0.16777216
np.__version__
'1.10.0.dev-Unknown'
np.ones(100000000, dtype=np.float32).mean()
1.0

On Thu, Jul 24, 2014 at 8:27 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Jul 24, 2014 at 4:56 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
In practice one of the better methods is pairwise summation that is pretty much as fast as a naive summation but has an accuracy of O(logN) ulp. This is the method numpy 1.9 will use this method by default (+ its even a bit faster than our old implementation of the naive sum): https://github.com/numpy/numpy/pull/3685
but it has some limitations, it is limited to blocks fo the buffer size (8192 elements by default) and does not work along the slow axes due to limitations in the numpy iterator.
For what it's worth, I see the issue on a 64-bit Windows numpy 1.8, but cannot on a 32-bit Windows numpy master:
np.__version__
'1.8.0'
np.ones(100000000, dtype=np.float32).mean()
0.16777216
np.__version__
'1.10.0.dev-Unknown'
np.ones(100000000, dtype=np.float32).mean()
1.0
Interesting. Might be compiler related as there are many choices for floating point instructions/registers in i386. The i386 version may effectively be working in double precision.
Chuck

On Thu, Jul 24, 2014 at 12:59 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Thu, Jul 24, 2014 at 8:27 AM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Thu, Jul 24, 2014 at 4:56 AM, Julian Taylor < jtaylor.debian@googlemail.com> wrote:
In practice one of the better methods is pairwise summation that is pretty much as fast as a naive summation but has an accuracy of O(logN) ulp. This is the method numpy 1.9 will use this method by default (+ its even a bit faster than our old implementation of the naive sum): https://github.com/numpy/numpy/pull/3685
but it has some limitations, it is limited to blocks fo the buffer size (8192 elements by default) and does not work along the slow axes due to limitations in the numpy iterator.
For what it's worth, I see the issue on a 64-bit Windows numpy 1.8, but cannot on a 32-bit Windows numpy master:
np.__version__
'1.8.0'
np.ones(100000000, dtype=np.float32).mean()
0.16777216
np.__version__
'1.10.0.dev-Unknown'
np.ones(100000000, dtype=np.float32).mean()
1.0
Interesting. Might be compiler related as there are many choices for floating point instructions/registers in i386. The i386 version may effectively be working in double precision.
Also note the different numpy version. Julian told that numpy 1.9 will use a more precise version in that case. That could explain that.
Fred

On 7/24/2014 5:59 AM, Eelco Hoogendoorn wrote to Thomas:
np.mean isn't broken; your understanding of floating point number is.
This comment seems to conflate separate issues: the desirable return type, and the computational algorithm. It is certainly possible to compute a mean of float32 doing reduction in float64 and still return a float32. There is nothing implicit in the name `mean` that says we have to just add everything up and divide by the count.
My own view is that `mean` would behave enough better if computed as a running mean to justify the speed loss. Naturally similar issues arise for `var` and `std`, etc. See http://www.johndcook.com/standard_deviation.html for some discussion and references.
Alan Isaac
participants (9)
-
Alan G Isaac
-
Charles R Harris
-
Eelco Hoogendoorn
-
Fabien
-
Frédéric Bastien
-
Jaime Fernández del Río
-
Joseph Martinot-Lagarde
-
Julian Taylor
-
Thomas Unterthiner