Re: [Numpy-discussion] please change mean to use dtype=float

20 Sep 2006


      David M. Cooke wrote:
...
On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
...
Let me offer a third path: the algorithms used for .mean() and .var() are 
substandard. There are much better incremental algorithms that entirely avoid 
the need to accumulate such large (and therefore precision-losing) intermediate 
values. The algorithms look like the following for 1D arrays in Python:
def mean(a):
     m = a[0]
     for i in range(1, len(a)):
         m += (a[i] - m) / (i + 1)
     return m
This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration).
With one exception, every test that I've thrown at it shows that it's better for 
float32. That exception is uniformly spaced arrays, like linspace().
...
You do avoid
accumulating large sums, but then doing the division a[i]/len(a) and
adding that will do the same.
Okay, this is true.
...
Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:
def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)
Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.
(We could probably use a fast implementation of Kahan summation in
addition to a.sum())
+1
...
...
def var(a):
     m = a[0]
     t = a.dtype.type(0)
     for i in range(1, len(a)):
         q = a[i] - m
         r = q / (i+1)
         m += r
         t += i * q * r
     t /= len(a)
     return t
Alternatively, from Knuth:
def var_knuth(a):
     m = a.dtype.type(0)
     variance = a.dtype.type(0)
     for i in range(len(a)):
         delta = a[i] - m
         m += delta / (i+1)
         variance += delta * (a[i] - m)
     variance /= len(a)
     return variance
These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.
Again, my tests show otherwise for float32. I'll condense my ipython log into a 
module for everyone's perusal. It's possible that the Kahan summation of the 
squared residuals will work better than the current two-pass algorithm and the 
implementations I give above.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco