[Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable

Fri Sep 20 12:45:33 CEST 2013

On 19 September 2013 22:25, Terry Reedy <tjreedy at udel.edu> wrote:
> On 9/19/2013 8:28 AM, Steven D'Aprano wrote:
>>
>> def variance(data):
>>      # Don't do this.
>>      sumx = sum(data)
>>      sumx2 = sum(x**2 for x in data)
>>      ss = sumx2 - (sumx**2)/n
>>      return ss/(n-1)
>>
>> Ignore the fact that this algorithm is numerically unstable.
>
> Lets not ;-)
>
>> It fails
>> for iterator arguments, because sum(data) consumes the iterator and
>> leaves sumx2 always equal to zero.
>
> This is doubly bad design because the two 'hidden' loops are trivially
> jammed together in one explicit loop, while use of Reiterable would not
> remove the numerical instability. While it may seem that a numerically
> stable solution needs two loops (the second to sum (x-sumx)**2), the two
> loops can still be jammed together with the Method of Provisional Means.
>
> http://www.stat.wisc.edu/~larget/math496/mean-var.html
> http://www.statistical-solutions-software.com/BMDP-documents/BMDP-Formula1.pdf
>
> Also called 'online algorithm' and 'Weighted incremental algorithm' in
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
>
> This was invented and used back when re-iteration of large datasets (on
> cards or tape) was possible but very slow (1970s or before). (Restack or
> rewind and reread might triple the (expensive) run time.)

I'm never quite sure what exactly is meant by "numerical instability"
in most contexts because I'm mainly familiar with the use of the term
in ODE solvers where an unstable solver will literally diverge from
the true solution as if it were an unstable equilibrium. However in
that context the error is truncation error rather than rounding error
and would occur even with infinite precision arithmetic:
http://en.wikipedia.org/wiki/Stiff_equation

I'm going to assume that numerical instability is just a way of saying
that a method is inaccurate in some cases.

Although the incremental algorithm is much better than the naive
approach Steven (knowingly) showed above I don't think it's true that
constraining yourself to a single pass doesn't limit the possible
accuracy. Another point of relevance here is that the incremental
formula cannot be as efficiently implemented in Python since you don't
get to take advantage of the super fast math.fsum function which is
also more accurate than a naive Kahan algorithm.

The script at the bottom of this post tests a few methods on a
deliberately awkward set of random numbers and typical output is:

$ ./stats.py
exact: 0.989661716301
naive -> error = -21476.0408922
incremental -> error = -1.0770901604e-07
two_pass -> error = 1.29118937764e-13
three_pass -> error = 0.0

For these numbers the three_pass method usually has an error of 0 but
otherwise  1ulp (1e-16). (It can actually be collapsed into a two pass
method but then we couldn't use fsum.)

If you know of a one-pass algorithm (or a way to improve the
implementation I showed) that is as accurate as either the two_pass or
three_pass methods I'd be very interested to see it (I'm sure Steven
would be as well).

Oscar

$ cat stats.py
#!/usr/bin/env python

from __future__ import print_function

from random import gauss
from math import fsum
from fractions import Fraction

# Return the exact result as a Fraction. Nothing wrong
# with using the computational formula for variance here.
def variance_exact(data):
    data = [Fraction(x) for x in data]
    n = len(data)
    sumx = sum(data)
    sumx2 = sum(x**2 for x in data)
    ss = sumx2 - (sumx**2)/n
    return ss/(n-1)

# Although this is the most efficient formula when using
# exact computation it fails under fixed precision
# floating point since it ends up subtracting two large
# almost equal numbers leading to a catastrophic loss of
# precision.
def variance_naive(data):
    n = len(data)
    sumx = fsum(data)
    sumx2 = fsum(x**2 for x in data)
    ss = sumx2 - (sumx**2)/n
    return ss/(n-1)

# Incremental variance calculation from Wikipedia. If
# the above uses fsum then a fair comparison should
# use some compensated summation here also. However
# it's not clear (to me) how to incorporate compensated
# summation here.
# http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Compensated_variant
def variance_incremental(data):
    n = 0
    mean = 0
    M2 = 0

    for x in data:
        n = n + 1
        delta = x - mean
        mean = mean + delta/n
        M2 = M2 + delta*(x - mean)

    variance = M2/(n - 1)
    return variance

# This is to me the obvious formula since I think of
# this as the definition of the variance.
def variance_twopass(data):
    n = len(data)
    mean = fsum(data) / n
    sumdev2 = fsum((x - mean)**2 for x in data)
    variance = sumdev2 / (n - 1)
    return variance

# This is the three-pass algorithm used in Steven's
# statistics module. It's not one I had seen before but
# AFAICT it's very accurate. In fact the 2nd and 3rd passes
# can be merged as in variance_incremental but then we
# wouldn't be able to take advantage of fsum.
def variance_threepass(data):
    n = len(data)
    mean = fsum(data) / n
    sumdev2 = fsum((x-mean)**2 for x in data)
    # The following sum should mathematically equal zero, but due to rounding
    # error may not.
    sumdev2 -= fsum((x-mean) for x in data)**2 / n
    return sumdev2 / (n - 1)

methods = [
    ('naive', variance_naive),
    ('incremental', variance_incremental),
    ('two_pass', variance_twopass),
    ('three_pass', variance_threepass),
]

# Test numbers with large mean and small standard deviation.
# This is the case that causes trouble for the naive formula.
N = 100000
testnums = [gauss(mu=10**10, sigma=1) for n in range(N)]

# First compute the exact result
exact = variance_incremental([Fraction(num) for num in testnums])
print('exact:', float(exact))

# Compare each with the exact result
for name, var in methods:
    print(name, '-> error =', var(testnums) - exact)