Re: [Numpydiscussion] numpy.mean still broken for largefloat32arrays
At 07:22 AM 7/25/2014, you wrote:
We were talking on this in the office, as we realized it does affect a couple of lines dealing with large arrays, including complex64. As I expect Python modules to work uniformly cross platform unless documented otherwise, to me that includes 32 vs 64 bit platforms, implying that the modules should automatically use large enough accumulators for the data type input.
The 32/64bitness of your platform has nothing to do with floating point.
As a naive end user, I can, and do, download different binaries for different CPUs/Windows versions and will get different results http://mail.scipy.org/pipermail/numpydiscussion/2014July/070747.html
Nothing discussed in this thread is platformspecific (modulo some minor details about the hardware FPU, but that should be taken as read).
And compilers, apparently. The important point was that it would be best if all of the methods affected by summing 32 bit floats with 32 bit accumulators had the same Notes as numpy.mean(). We went through a lot of code yesterday, assuming that any numpy or Scipy.stats functions that use accumulators suffer the same issue, whether noted or not, and found it true. "Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higherprecision accumulator using the <http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtype>dtype keyword can alleviate this issue." seems rather unPythonic.  Ray
On Fri, Jul 25, 2014 at 5:56 PM, RayS <rays@bluecove.com> wrote:
The important point was that it would be best if all of the methods affected by summing 32 bit floats with 32 bit accumulators had the same Notes as numpy.mean(). We went through a lot of code yesterday, assuming that any numpy or Scipy.stats functions that use accumulators suffer the same issue, whether noted or not, and found it true.
Do you have a list of the functions that are affected?
"Depending on the input data, this can cause the results to be inaccurate, especially for float32 (see example below). Specifying a higherprecision accumulator using the dtype keyword can alleviate this issue." seems rather unPythonic.
It's true that in its full generality, this problem just isn't something numpy can solve. Using float32 is extremely dangerous and should not be attempted unless you're prepared to seriously analyze all your code for numeric stability; IME it often runs into problems in practice, in any number of ways. Remember that it only has as much precision as a 24 bit integer. There are good reasons why float64 is the default! That said, it does seem that np.mean could be implemented better than it is, even given float32's inherent limitations. If anyone wants to implement better algorithms for computing the mean, variance, sums, etc., then we would love to add them to numpy. I'd suggest implementing them as gufuncs  there are examples of defining gufuncs in numpy/linalg/umath_linalg.c.src. n  Nathaniel J. Smith Postdoctoral researcher  Informatics  University of Edinburgh http://vorpus.org
participants (2)

Nathaniel Smith

RayS