[Python-Dev] Accumulation module

Fri Jan 16 23:21:10 EST 2004

[Raymond Hettinger, on using Knuth's recurrence for variance]
> I found it.
>
> There is a suggestion that the formula would be better off
> if the recurrence sums are accumulated along with a cumulative
> error term that tracks the loss of significance when adding
> a small adjustment to a larger sum.  The textbook exercise
> indicates that that technique is the most useful in the presence
> of crummy rounding.  Do you think there is any value in using
> the error term tracking?

The tension between speed and accuracy in floating numerics bites on every
operation.  Tricks like Kahan summation can be valuable, but they can also
so grossly complicate the code as to introduce subtle bugs of their own, and
can also lull into believing they deliver more than they do.  For a
discussion of numeric *problems* with Kahan's gimmick, and yet another layer
of complication to overcome them (well, to a milder level of weakness):

    http://www.owlnet.rice.edu/~caam420/lectures/fpsum.html

The numeric problems for mean and variance are worst when the inputs vary in
sign and span a large dynamic range.  But that's unusual in real-life data.
For example, computing the mean height of American adults involves
all-positive numbers that don't even span a factor of 10.

As a bad example, consider this vector:

    x = [1e10]*10 + [21] + [-1e10]*10

There are 21 values total, the first and last 10 cancel out, so the true
mean is 21/21 = 1.  Here are two ways to compute the mean:

def mean1(x):
    return sum(x) / float(len(x))

def mean2(x):
    m = 0.0
    for i, elt in enumerate(x):
        m += (elt - m) / (i+1.0)
    return m

Then

>>> mean1(x)
1.0
>>> mean2(x)
0.99999988079071045
>>>

mean2 is relatively atrocious, and you'll recognize it now as Knuth's
recurrence for the mean.  sum(x) is actually exact with 754 double
arithmetic (we're not even close to using up 53 bits), and the 9 orders of
magnitude spanned by the input elements is much larger than seen in most
real-life sources of data.  The problem with mean2 here is subtler than just
that.

So the one thing I absolutely recommend with Knuth's method is keeping a
running sum, computing the mean fresh on each iteration, instead of using
the mean recurrence.  If that's done, Knuth's recurrence for the sum of
(x_i-mean)**2 works very well even for the "bad example" above.

> ...
> Without the error term tracking, the inner loop simplifies to:
>
>         k += 1
>         newm = m+(x-m)/k
>         s += (x-m)*(x-newm)
>         m = newm

I hope you worked out the derivation for s!  It's astonishing that such a
complicated update could collapse to such a simple expression.  The
suggestion above amounts to replacing

        newm = m+(x-m)/k

by

        sum += x
        newm = sum / k