[Python-ideas] Pre-PEP: adding a statistics module to Python

Thu Aug 8 13:38:01 CEST 2013

On 8 August 2013 01:19, Steven D'Aprano <steve at pearwood.info> wrote:
> On 08/08/13 08:24, Ethan Furman wrote:
>>
>> On 08/07/2013 03:20 PM, David Mertz wrote:
>>>
>>> Here's a question for the actual statisticians on the list (I'm not close
>>> to this).  Would having a look-ahead window of
>>> moderate size (probably configurable) do enough good in numeric accuracy
>>> to be worthwhile?  Obviously, creating
>>> pathological cases is still possible, but in the "normal" situation, does
>>> this matter enough?  I.e. if the function were
>>> to read 100 numbers from an iterator, perform some manipulation on their
>>> ordering or scaling, produce that better
>>> intermediate result, then do the same with the next chunk of 100 numbers,
>>> is this enough of a win to have as an option?
>>
>>
>> I have a follow-up question:  considering the built-in error when
>> calculating statistics, is the difference between sequence and iterator
>> significant?  Would we be just as well served with `it = iter(sequence)` and
>> always using the one-pass algorithm?
>
> To the best of my knowledge, there is no widely-known algorithm for
> accurately calculating variance in chunks. There's a two-pass algorithm, and
> a one-pass algorithm, and that's what people use. My guess is that you could
> adapt the one-pass algorithm to look-ahead in chunks, but to my mind that
> adds complexity for no benefit.

It's true that chunkifying algorithms are rarely used when computing
over a single-pass. However there are algorithms for working with
chunks that are used for distributed/parallel computation. A quick
test on my machine indicates that it is possible to get an accuracy
that comes between that of the 1-pass and 2-pass algorithms. Depending
on the API for 1-pass statistics it might be that the data naturally
arrives in chunks or that there are other reasons to chunk it anyway
in which case it could be good to use these. Also it's useful if
there's ever an API for merging stats computed in different places.

The method below is apparently due to Chan et. al but I'd be lying if
I said I consulted anything apart from Wikipedia so see here for more
info:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

> Computationally, there is a small, but real, difference in the two
> algorithms, as can be seen by the fact that they give different results.
> Does the difference make a difference *in practice*, given that most
> real-world measurements are only accurate to 2-4 decimal places? No, not
> really. If your data is accurate to 3 decimal places, you've got no business
> quoting the variance to 15 decimal places.
>
> But, people will compare the std lib variance to numpy's variance, to
> Excel's variance, to their calculator's variance, and when there is a
> discrepancy, I would rather it be in our favour rather than have to make the
> excuse "well you see we picked the slightly less accurate algorithm in order
> to prematurely optimize for enormous data sets too big to fit into memory".
>
> The two-pass algorithm stays. I'm deferring to 3.5 a set of one-pass
> iterator friendly functions that will be suitable for calculating multiple
> statistics from a single data stream without building a list.

You have the right approach here. I don't think numpy or any of the
statistical libraries you listed in the PEP provides 1-pass
statistics. AFAIK they all use the 2-pass algorithm to compute
variance. The difference may be small but one method is definitively
more accurate and usually more efficient (really an additional pass
over a list is nothing compared to the actual computation).

Some code is below and here's some example output (it uses random data
so this isn't repeatable):

$ python3 stats.py 10000
Data points: 10000
Exact result: 250000020133.92606
pvariance: 250000020133.92606
    error: 0.0e+00
pvar1pass: 250000020133.92624
    error: -1.83e-04
pvarchunk: 250000020133.92606
    error: 0.00e+00
Data points: 2000000
pvariance: 249999999484.0627
pvar1pass: 249999999484.0557
    error: 7.02e-03
pvarchunk: 249999999484.0625
    error: 2.14e-04

So Steven's pvariance function can often compute the variance to
within machine precision of the true value (that's as accurate as it
possibly could be). The 1-pass variance has an error that's on the
order of machine precision (plenty accurate enough). Chunking with
chunksize 100 gives an error that's about 10 times smaller (presumably
there's a sqrt(N) effect).

The code:

#!/usr/bin/env python

from itertools import takewhile, islice, repeat

import statistics

def chunks(iterable, chunksize=100):
    '''Break iterable into chunks.

    >>> for chunk in chunks(range(8), 3):
    ...     print(chunk)
    [0, 1, 2]
    [3, 4, 5]
    [6, 7]
    '''
    islices = map(islice, repeat(iter(iterable)), repeat(chunksize))
    return takewhile(bool, map(list, islices))

def pvarchunks(iterable, chunksize=100):
    '''Estimate variance in 1-pass by chunkifying iterable.

    >>> # Use an exact computation for demonstration
    >>> from fractions import Fraction
    >>> dice = [Fraction(n) for n in range(1, 6+1)]
    >>> print(pvarchunks(dice, chunksize=4))
    35/12

    >>> # For comparison
    >>> import statistics
    >>> print(statistics.pvariance(dice))
    35/12

    '''
    count = 0   # number of items seen
    xbar = 0    # current estimate of mean
    ssqdev = 0  # current estimate of sum of squared deviation
    for chunk in chunks(iterable, chunksize):
        counti = len(chunk)
        xbari = statistics.mean(chunk)
        ssqdevi = statistics._direct(chunk, xbari, counti)
        # Merge counti, xbari, ssqdevi
        delta = (((xbari - xbar)**2) * (count*counti)) / (count + counti)
        xbar = (count*xbar + counti*xbari) / (count + counti)
        ssqdev = ssqdev + ssqdevi + delta
        count = count + counti
    # Compute the final result
    return ssqdev / count

if __name__ == "__main__":
    import doctest
    doctest.testmod()

    import sys
    from random import gauss, shuffle
    from fractions import Fraction

    # Test numerical accuracy and compare with other methods
    # N is the size of the test.
    N1 = int(sys.argv[1]) if sys.argv[1:] else 200
    N2 = int(sys.argv[2]) if sys.argv[2:] else 1000000

    # Try to be numerically awkward
    data = [gauss(0, 1) for n in range(N1 // 2)]
    data += [gauss(1000000, 1) for n in range(N1 // 2)]
    shuffle(data)
    print('Data points:', len(data))

    # First we'll get an exact result using fractions
    # This is really slow...
    fnums = [Fraction(x) for x in data]
    exact_var = statistics.pvariance(fnums)
    print('Exact result:', float(exact_var))

    # compare with pvariance on floats
    pvarfloat = statistics.pvariance(data)
    print('pvariance:', pvarfloat)
    print('    error: %.1e' % (exact_var - pvarfloat))

    # compare with 1-pass algorithm
    pvar1pass = statistics.pvariance(iter(data))
    print('pvar1pass:', pvar1pass)
    print('    error: %.2e' % (exact_var - pvar1pass))

    # compare with chunked 1-pass algorithm
    pvarchunk = pvarchunks(data, chunksize=100)
    print('pvarchunk:', pvarchunk)
    print('    error: %.2e' % (exact_var - pvarchunk))

    # Okay now we'll try a really large amount of data
    # (and not bother with fractions)
    data = [gauss(0, 1) for n in range(N2)]
    data += [gauss(1000000, 1) for n in range(N2)]
    shuffle(data)
    print('Data points:', len(data))

    # pvariance on floats (use as true answer in this case)
    pvarfloat = statistics.pvariance(data)
    print('pvariance:', pvarfloat)

    pvar1pass = statistics.pvariance(iter(data))
    print('pvar1pass:', pvar1pass)
    print('    error: %.2e' % (pvarfloat - pvar1pass))

    # Now compare with chunks
    pvarchunk = pvarchunks(data, chunksize=100)
    print('pvarchunk:', pvarchunk)
    print('    error: %.2e' % (pvarfloat - pvarchunk))

Oscar