[Numpy-discussion] numpy.mean still broken for largefloat32arrays

Fri Jul 25 17:36:27 EDT 2014

On Fri, Jul 25, 2014 at 4:25 PM, RayS <rays at blue-cove.com> wrote:

> At 11:29 AM 7/25/2014, you wrote:
> >On Fri, Jul 25, 2014 at 5:56 PM, RayS <rays at blue-cove.com> wrote:
> > > The important point was that it would be best if all of the
> > methods affected
> > > by summing 32 bit floats with 32 bit accumulators had the same Notes as
> > > numpy.mean(). We went through a lot of code yesterday, assuming that
> any
> > > numpy or Scipy.stats functions that use accumulators suffer the same
> issue,
> > > whether noted or not, and found it true.
> >
> >Do you have a list of the functions that are affected?
>
> We only tested a few we used, but
> scipy.stats.nanmean, or any .*mean()
> numpy.sum, mean, average, std, var,...
>
> via something like:
>
> import numpy
> import scipy.stats
> print numpy.__version__
> print scipy.__version__
> onez = numpy.ones((2**25, 1), numpy.float32)
> step = 2**10
> func = scipy.stats.nanmean
> for s in range(2**24-step, 2**25, step):
>      if func(onez[:s+step])!=1.:
>          print '\nbroke', s, func(onez[:s+step])
>          break
>      else:
>          print '\r',s,
>
> >  That said, it does seem that np.mean could be implemented better than
> >it is, even given float32's inherent limitations. If anyone wants to
> >implement better algorithms for computing the mean, variance, sums,
> >etc., then we would love to add them to numpy.
>
> Others have pointed out the possible tradeoffs in summation algos,
> perhaps a method arg would be appropriate, "better" depending on your
> desire for speed vs. accuracy.
>

I think this would be a good improvement.
But it doesn't compensate for users to be aware of the problems. I think
the docstring and the description of the dtype argument is pretty clear.

I'm largely with Eelco, whatever precision or algorithm we use, with
floating point calculations we run into problems in some cases. And I don't
think this is a "broken" function but a design decision that takes the
different tradeoffs into account.
Whether it's the right decision is an open question, if there are better
algorithm with essentially not disadvantages.

Two examples:
I had problems to verify some results against Stata at more than a few
significant digits, until I realized that Stata had used float32 for the
calculations by default in this case, while I was working with float64.
Using single precision linear algebra causes the same numerical problems as
numpy.mean runs into.

A few years ago I tried to match some tougher NIST examples that were
intentionally very badly scaled. numpy.mean at float64 had quite large
errors, but a simple trick with two passes through the data managed to get
very close to the certified NIST examples.

my conclusion:
don't use float32 unless you know you don't need any higher precision.
even with float64 it is possible to run into extreme cases where you get
numerical garbage or large precision losses.
However, in the large majority of cases a boring fast "naive"
implementation is enough.

Also, whether we use mean, sum or dot in a calculation is an implementation
detail, which in the case of dot doesn't have a dtype argument and always
depends on the dtype of the arrays, AFAIK.
Separating the accumulation dtype from the array dtype would require a lot
of work except in the simplest cases, like those that only use sum and mean
with specified dtype argument.

trying out the original example:

>>> X = np.ones((50000, 1024), np.float32)
>>> X.mean()
1.0
>>> X.mean(dtype=np.float32)
1.0

>>> np.dot(X.ravel(), np.ones(X.ravel().shape) *1. / X.ravel().shape)
1.0000000002299174

>>> np.__version__
'1.5.1'

Win32

Josef

>
> It just occurred to me that if the STSI folks (who count photons)
> took the mean() or other such func of an image array from Hubble
> sensors to find background value, they'd better always be using float64.
>
>   - Ray
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140725/496b54f0/attachment.html>