[Python-ideas] Proposal: add a calculator statistics module

Tue Sep 13 03:00:35 CEST 2011

I propose adding a basic calculator statistics module to the standard 
library, similar to the sorts of functions you would get on a scientific 
calculator:

mean (average)
variance (population and sample)
standard deviation (population and sample)
correlation coefficient

and similar. I am volunteering to provide, and support, this module, 
written in pure Python so other implementations will be able to use it.

Simple calculator-style statistics seem to me to be a fairly obvious 
"battery" to be included, more useful in practice than some functions 
already available such as factorial and the hyperbolic functions.

The lack of a standard solution leads people who need basic stats to 
roll their own. This seems seductively simple, as the basic stats 
formulae are quite simple. Unfortunately doing it *correctly* is much 
harder than it seems. Variance, in particular, is prone to serious 
inaccuracies. Here is the most obvious algorithm, using the so-called 
"computational formula for the variance":

def variance(data):
     #   σ2 = 1/n**2 * (n*Σ(x**2) - (Σx)**2)
     n = len(data)
     s1 = sum(x**2 for x in data)
     s2 = sum(data)
     return (n*s1 - s2**2)/(n*n)

Many stats text books recommend this as the best way to calculate 
variance, advice which makes sense when you're talking about hand 
calculations of small numbers of moderate sized data, but not for 
floating point. It appears to work:

 >>> data = [1, 2, 4, 5, 8]
 >>> variance(data)  # exact value = 6
6.0

but unfortunately it is numerically unstable. Shifting all the data 
points by a constant amount shouldn't change the variance, but it does:

 >>> data = [x+1e12 for x in data]
 >>> variance(data)
171798691.84

Even worse, variance should never be negative:

 >>> variance(data*100)
-1266637395.197952

Note that using math.fsum instead of the built-in sum does not fix the 
numeric instability problem, and it adds the additional problem that it 
coerces the data points to float. (If you use Decimal, this may not be 
what you want.)

Here is an example of published code which suffers from exactly this 
problem:

https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py

and here is an example on StackOverflow. Note the most popular answer 
given is to use the Computational Formula, which is the wrong answer.

http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration

I would like to add a module to the standard library to solve these 
sorts of simple stats problems the right way, once and for all.

Thoughts, comments, objections or words of encouragement are welcome.

-- 
Steven