[Numpy-discussion] Bug in numpy.mean() revisited

Fri Jul 27 00:15:14 EDT 2012

On Thu, Jul 26, 2012 at 9:26 PM, Tom Aldcroft <aldcroft at head.cfa.harvard.edu
> wrote:

> There was a thread in January discussing the non-obvious behavior of
> numpy.mean() for large arrays of float32 values [1].  This issue is
> nicely discussed at the end of the numpy.mean() documentation [2] with
> an example:
>
> >>> a = np.zeros((2, 512*512), dtype=np.float32)
> >>> a[0, :] = 1.0
> >>> a[1, :] = 0.1
> >>> np.mean(a)
> 0.546875
>
> >From the docs and previous discussion it seems there is no technical
> difficulty in choosing a different (higher precision) type for the
> accumulator using the dtype arg, and in fact this is done
> automatically for int values.
>
> My question is whether there would be any support for doing something
> more than documenting this behavior.  I suspect very few people ever
> make it below the fold for the np.mean() documentation.  Taking the
> mean of large arrays of float32 values is a *very* common use case and
> giving the wrong answer with default inputs is really disturbing.  I
> recently had to rebuild a complex science data archive because of
> corrupted mean values.
>
> Possible ideas to stimulate discussion:
> 1. Always use float64 to accumulate float types that are 64 bits or
> less.   Are there serious performance impacts to automatically using
> float64 to accumulate float32 arrays?  I appreciate this would likely
> introduce unwanted regressions (sometimes suddenly getting the right
> answer is a bad thing).  So could this be considered for numpy 2.0?
>
> 2. Might there be a way to emit a warning if the number of values and
> the max accumulated value [3] are such that the estimated fractional
> error is above some tolerance?  I'm not even sure if this is a good
> idea or if there will be howls from the community as their codes start
> warning about inaccurate mean values.  Better idea along this line??
>
>
I would support accumulating in 64 bits but, IIRC, the function will need
to be rewritten so that it works by adding 32 bit floats to the accumulator
to save space. There are also more stable methods that could also be
investigated. There is a nice little project there for someone to cut their
teeth on.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120726/c8d45cdd/attachment.html>