Re: [Numpydiscussion] numpy.mean still broken for largefloat32arrays
At 11:29 AM 7/25/2014, you wrote:
On Fri, Jul 25, 2014 at 5:56 PM, RayS <rays@bluecove.com> wrote:
The important point was that it would be best if all of the methods affected by summing 32 bit floats with 32 bit accumulators had the same Notes as numpy.mean(). We went through a lot of code yesterday, assuming that any numpy or Scipy.stats functions that use accumulators suffer the same issue, whether noted or not, and found it true.
Do you have a list of the functions that are affected?
We only tested a few we used, but scipy.stats.nanmean, or any .*mean() numpy.sum, mean, average, std, var,... via something like: import numpy import scipy.stats print numpy.__version__ print scipy.__version__ onez = numpy.ones((2**25, 1), numpy.float32) step = 2**10 func = scipy.stats.nanmean for s in range(2**24step, 2**25, step): if func(onez[:s+step])!=1.: print '\nbroke', s, func(onez[:s+step]) break else: print '\r',s,
That said, it does seem that np.mean could be implemented better than it is, even given float32's inherent limitations. If anyone wants to implement better algorithms for computing the mean, variance, sums, etc., then we would love to add them to numpy.
Others have pointed out the possible tradeoffs in summation algos, perhaps a method arg would be appropriate, "better" depending on your desire for speed vs. accuracy. It just occurred to me that if the STSI folks (who count photons) took the mean() or other such func of an image array from Hubble sensors to find background value, they'd better always be using float64.  Ray
On Fri, Jul 25, 2014 at 4:25 PM, RayS <rays@bluecove.com> wrote:
At 11:29 AM 7/25/2014, you wrote:
On Fri, Jul 25, 2014 at 5:56 PM, RayS <rays@bluecove.com> wrote:
The important point was that it would be best if all of the methods affected by summing 32 bit floats with 32 bit accumulators had the same Notes as numpy.mean(). We went through a lot of code yesterday, assuming that any numpy or Scipy.stats functions that use accumulators suffer the same issue, whether noted or not, and found it true.
Do you have a list of the functions that are affected?
We only tested a few we used, but scipy.stats.nanmean, or any .*mean() numpy.sum, mean, average, std, var,...
via something like:
import numpy import scipy.stats print numpy.__version__ print scipy.__version__ onez = numpy.ones((2**25, 1), numpy.float32) step = 2**10 func = scipy.stats.nanmean for s in range(2**24step, 2**25, step): if func(onez[:s+step])!=1.: print '\nbroke', s, func(onez[:s+step]) break else: print '\r',s,
That said, it does seem that np.mean could be implemented better than it is, even given float32's inherent limitations. If anyone wants to implement better algorithms for computing the mean, variance, sums, etc., then we would love to add them to numpy.
Others have pointed out the possible tradeoffs in summation algos, perhaps a method arg would be appropriate, "better" depending on your desire for speed vs. accuracy.
I think this would be a good improvement. But it doesn't compensate for users to be aware of the problems. I think the docstring and the description of the dtype argument is pretty clear. I'm largely with Eelco, whatever precision or algorithm we use, with floating point calculations we run into problems in some cases. And I don't think this is a "broken" function but a design decision that takes the different tradeoffs into account. Whether it's the right decision is an open question, if there are better algorithm with essentially not disadvantages. Two examples: I had problems to verify some results against Stata at more than a few significant digits, until I realized that Stata had used float32 for the calculations by default in this case, while I was working with float64. Using single precision linear algebra causes the same numerical problems as numpy.mean runs into. A few years ago I tried to match some tougher NIST examples that were intentionally very badly scaled. numpy.mean at float64 had quite large errors, but a simple trick with two passes through the data managed to get very close to the certified NIST examples. my conclusion: don't use float32 unless you know you don't need any higher precision. even with float64 it is possible to run into extreme cases where you get numerical garbage or large precision losses. However, in the large majority of cases a boring fast "naive" implementation is enough. Also, whether we use mean, sum or dot in a calculation is an implementation detail, which in the case of dot doesn't have a dtype argument and always depends on the dtype of the arrays, AFAIK. Separating the accumulation dtype from the array dtype would require a lot of work except in the simplest cases, like those that only use sum and mean with specified dtype argument. trying out the original example:
X = np.ones((50000, 1024), np.float32) X.mean() 1.0 X.mean(dtype=np.float32) 1.0
np.dot(X.ravel(), np.ones(X.ravel().shape) *1. / X.ravel().shape) 1.0000000002299174
np.__version__ '1.5.1'
Win32 Josef
It just occurred to me that if the STSI folks (who count photons) took the mean() or other such func of an image array from Hubble sensors to find background value, they'd better always be using float64.
 Ray
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
participants (2)

josef.pktd＠gmail.com

RayS