[Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
Terry Reedy
tjreedy at udel.edu
Fri Sep 20 22:01:28 CEST 2013
On 9/20/2013 6:45 AM, Oscar Benjamin wrote:
> I'm going to assume that numerical instability is just a way of saying
> that a method is inaccurate in some cases.
Good enough for me.
> Although the incremental algorithm is much better than the naive
> approach Steven (knowingly) showed above I don't think it's true that
> constraining yourself to a single pass doesn't limit the possible
> accuracy.
That may be one difference between integer and float arithmetic. The
order of operations makes a difference.
> Another point of relevance here is that the incremental
> formula cannot be as efficiently implemented in Python since you don't
> get to take advantage of the super fast math.fsum function which is
> also more accurate than a naive Kahan algorithm.
Yes. One of the differences between 'theoretical' algorithms and
practical algorithms coded in CPython is the bias toward using functions
already coded in C.
> The script at the bottom of this post tests a few methods on a
> deliberately awkward set of random numbers and typical output is:
Thanks for doing this.
> $ ./stats.py
> exact: 0.989661716301
> naive -> error = -21476.0408922
> incremental -> error = -1.0770901604e-07
> two_pass -> error = 1.29118937764e-13
> three_pass -> error = 0.0
The incremental method is good enough for data measured to 3 significant
figures, as is typical in as least parts of some sciences, and the data
I worked with for a decade. But it is not good enough for substantially
more accurate data. The Python statistics module should cater to the
latter. The doc should just say that is requires a serially re-iterable
input. (A person with data too large to fit in memory could write an
iterable that opens a file and returns an iterator that reads blocks of
values and yields them one at a time.)
The incremental method is useful for returning running means and
deviations for data collected sporadically and indefinitely, without
needing to store the cumulative data. It is a nice, non-obvious example
of the principle that it is sometimes possible to summarize cumulative
data with a relatively small and fixed set of sufficient statistics.
> For these numbers the three_pass method usually has an error of 0 but
> otherwise 1ulp (1e-16). (It can actually be collapsed into a two pass
> method but then we couldn't use fsum.)
>
> If you know of a one-pass algorithm (or a way to improve the
> implementation I showed) that is as accurate as either the two_pass or
> three_pass methods I'd be very interested to see it (I'm sure Steven
> would be as well).
If I were trying to improve the incremental variance algorithm, I would
study the fsum method until a really understood it and then see if I
could apply the same ideas.
>
>
> Oscar
>
>
> $ cat stats.py
> #!/usr/bin/env python
>
> from __future__ import print_function
>
> from random import gauss
> from math import fsum
> from fractions import Fraction
>
> # Return the exact result as a Fraction. Nothing wrong
> # with using the computational formula for variance here.
> def variance_exact(data):
> data = [Fraction(x) for x in data]
> n = len(data)
> sumx = sum(data)
> sumx2 = sum(x**2 for x in data)
> ss = sumx2 - (sumx**2)/n
> return ss/(n-1)
>
> # Although this is the most efficient formula when using
> # exact computation it fails under fixed precision
> # floating point since it ends up subtracting two large
> # almost equal numbers leading to a catastrophic loss of
> # precision.
> def variance_naive(data):
> n = len(data)
> sumx = fsum(data)
> sumx2 = fsum(x**2 for x in data)
> ss = sumx2 - (sumx**2)/n
> return ss/(n-1)
>
> # Incremental variance calculation from Wikipedia. If
> # the above uses fsum then a fair comparison should
> # use some compensated summation here also. However
> # it's not clear (to me) how to incorporate compensated
> # summation here.
> # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Compensated_variant
> def variance_incremental(data):
> n = 0
> mean = 0
> M2 = 0
>
> for x in data:
> n = n + 1
> delta = x - mean
> mean = mean + delta/n
> M2 = M2 + delta*(x - mean)
>
> variance = M2/(n - 1)
> return variance
>
> # This is to me the obvious formula since I think of
> # this as the definition of the variance.
> def variance_twopass(data):
> n = len(data)
> mean = fsum(data) / n
> sumdev2 = fsum((x - mean)**2 for x in data)
> variance = sumdev2 / (n - 1)
> return variance
>
>
> # This is the three-pass algorithm used in Steven's
> # statistics module. It's not one I had seen before but
> # AFAICT it's very accurate. In fact the 2nd and 3rd passes
> # can be merged as in variance_incremental but then we
> # wouldn't be able to take advantage of fsum.
> def variance_threepass(data):
> n = len(data)
> mean = fsum(data) / n
> sumdev2 = fsum((x-mean)**2 for x in data)
> # The following sum should mathematically equal zero, but due to rounding
> # error may not.
> sumdev2 -= fsum((x-mean) for x in data)**2 / n
> return sumdev2 / (n - 1)
>
> methods = [
> ('naive', variance_naive),
> ('incremental', variance_incremental),
> ('two_pass', variance_twopass),
> ('three_pass', variance_threepass),
> ]
>
> # Test numbers with large mean and small standard deviation.
> # This is the case that causes trouble for the naive formula.
> N = 100000
> testnums = [gauss(mu=10**10, sigma=1) for n in range(N)]
>
> # First compute the exact result
> exact = variance_incremental([Fraction(num) for num in testnums])
> print('exact:', float(exact))
>
> # Compare each with the exact result
> for name, var in methods:
> print(name, '-> error =', var(testnums) - exact)
--
Terry Jan Reedy
More information about the Python-ideas
mailing list