[Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable

Fri Sep 20 22:01:28 CEST 2013

On 9/20/2013 6:45 AM, Oscar Benjamin wrote:

> I'm going to assume that numerical instability is just a way of saying
> that a method is inaccurate in some cases.

Good enough for me.

> Although the incremental algorithm is much better than the naive
> approach Steven (knowingly) showed above I don't think it's true that
> constraining yourself to a single pass doesn't limit the possible
> accuracy.

That may be one difference between integer and float arithmetic. The 
order of operations makes a difference.

> Another point of relevance here is that the incremental
> formula cannot be as efficiently implemented in Python since you don't
> get to take advantage of the super fast math.fsum function which is
> also more accurate than a naive Kahan algorithm.

Yes. One of the differences between 'theoretical' algorithms and 
practical algorithms coded in CPython is the bias toward using functions 
already coded in C.

> The script at the bottom of this post tests a few methods on a
> deliberately awkward set of random numbers and typical output is:

Thanks for doing this.

> $ ./stats.py
> exact: 0.989661716301
> naive -> error = -21476.0408922
> incremental -> error = -1.0770901604e-07
> two_pass -> error = 1.29118937764e-13
> three_pass -> error = 0.0

The incremental method is good enough for data measured to 3 significant 
figures, as is typical in as least parts of some sciences, and the data 
I worked with for a decade. But it is not good enough for substantially 
more accurate data. The Python statistics module should cater to the 
latter. The doc should just say that is requires a serially re-iterable 
input. (A person with data too large to fit in memory could write an 
iterable that opens a file and returns an iterator that reads blocks of 
values and yields them one at a time.)

The incremental method is useful for returning running means and 
deviations for data collected sporadically and indefinitely, without 
needing to store the cumulative data. It is a nice, non-obvious example 
of the principle that it is sometimes possible to summarize cumulative 
data with a relatively small and fixed set of sufficient statistics.

> For these numbers the three_pass method usually has an error of 0 but
> otherwise  1ulp (1e-16). (It can actually be collapsed into a two pass
> method but then we couldn't use fsum.)
>
> If you know of a one-pass algorithm (or a way to improve the
> implementation I showed) that is as accurate as either the two_pass or
> three_pass methods I'd be very interested to see it (I'm sure Steven
> would be as well).

If I were trying to improve the incremental variance algorithm, I would 
study the fsum method until a really understood it and then see if I 
could apply the same ideas.

>
>
> Oscar
>
>
> $ cat stats.py
> #!/usr/bin/env python
>
> from __future__ import print_function
>
> from random import gauss
> from math import fsum
> from fractions import Fraction
>
> # Return the exact result as a Fraction. Nothing wrong
> # with using the computational formula for variance here.
> def variance_exact(data):
>      data = [Fraction(x) for x in data]
>      n = len(data)
>      sumx = sum(data)
>      sumx2 = sum(x**2 for x in data)
>      ss = sumx2 - (sumx**2)/n
>      return ss/(n-1)
>
> # Although this is the most efficient formula when using
> # exact computation it fails under fixed precision
> # floating point since it ends up subtracting two large
> # almost equal numbers leading to a catastrophic loss of
> # precision.
> def variance_naive(data):
>      n = len(data)
>      sumx = fsum(data)
>      sumx2 = fsum(x**2 for x in data)
>      ss = sumx2 - (sumx**2)/n
>      return ss/(n-1)
>
> # Incremental variance calculation from Wikipedia. If
> # the above uses fsum then a fair comparison should
> # use some compensated summation here also. However
> # it's not clear (to me) how to incorporate compensated
> # summation here.
> # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Compensated_variant
> def variance_incremental(data):
>      n = 0
>      mean = 0
>      M2 = 0
>
>      for x in data:
>          n = n + 1
>          delta = x - mean
>          mean = mean + delta/n
>          M2 = M2 + delta*(x - mean)
>
>      variance = M2/(n - 1)
>      return variance
>
> # This is to me the obvious formula since I think of
> # this as the definition of the variance.
> def variance_twopass(data):
>      n = len(data)
>      mean = fsum(data) / n
>      sumdev2 = fsum((x - mean)**2 for x in data)
>      variance = sumdev2 / (n - 1)
>      return variance
>
>
> # This is the three-pass algorithm used in Steven's
> # statistics module. It's not one I had seen before but
> # AFAICT it's very accurate. In fact the 2nd and 3rd passes
> # can be merged as in variance_incremental but then we
> # wouldn't be able to take advantage of fsum.
> def variance_threepass(data):
>      n = len(data)
>      mean = fsum(data) / n
>      sumdev2 = fsum((x-mean)**2 for x in data)
>      # The following sum should mathematically equal zero, but due to rounding
>      # error may not.
>      sumdev2 -= fsum((x-mean) for x in data)**2 / n
>      return sumdev2 / (n - 1)
>
> methods = [
>      ('naive', variance_naive),
>      ('incremental', variance_incremental),
>      ('two_pass', variance_twopass),
>      ('three_pass', variance_threepass),
> ]
>
> # Test numbers with large mean and small standard deviation.
> # This is the case that causes trouble for the naive formula.
> N = 100000
> testnums = [gauss(mu=10**10, sigma=1) for n in range(N)]
>
> # First compute the exact result
> exact = variance_incremental([Fraction(num) for num in testnums])
> print('exact:', float(exact))
>
> # Compare each with the exact result
> for name, var in methods:
>      print(name, '-> error =', var(testnums) - exact)

-- 
Terry Jan Reedy