[Python-ideas] Proposal: add a calculator statistics module
Steven D'Aprano
steve at pearwood.info
Tue Sep 13 03:00:35 CEST 2011
I propose adding a basic calculator statistics module to the standard
library, similar to the sorts of functions you would get on a scientific
calculator:
mean (average)
variance (population and sample)
standard deviation (population and sample)
correlation coefficient
and similar. I am volunteering to provide, and support, this module,
written in pure Python so other implementations will be able to use it.
Simple calculator-style statistics seem to me to be a fairly obvious
"battery" to be included, more useful in practice than some functions
already available such as factorial and the hyperbolic functions.
The lack of a standard solution leads people who need basic stats to
roll their own. This seems seductively simple, as the basic stats
formulae are quite simple. Unfortunately doing it *correctly* is much
harder than it seems. Variance, in particular, is prone to serious
inaccuracies. Here is the most obvious algorithm, using the so-called
"computational formula for the variance":
def variance(data):
# σ2 = 1/n**2 * (n*Σ(x**2) - (Σx)**2)
n = len(data)
s1 = sum(x**2 for x in data)
s2 = sum(data)
return (n*s1 - s2**2)/(n*n)
Many stats text books recommend this as the best way to calculate
variance, advice which makes sense when you're talking about hand
calculations of small numbers of moderate sized data, but not for
floating point. It appears to work:
>>> data = [1, 2, 4, 5, 8]
>>> variance(data) # exact value = 6
6.0
but unfortunately it is numerically unstable. Shifting all the data
points by a constant amount shouldn't change the variance, but it does:
>>> data = [x+1e12 for x in data]
>>> variance(data)
171798691.84
Even worse, variance should never be negative:
>>> variance(data*100)
-1266637395.197952
Note that using math.fsum instead of the built-in sum does not fix the
numeric instability problem, and it adds the additional problem that it
coerces the data points to float. (If you use Decimal, this may not be
what you want.)
Here is an example of published code which suffers from exactly this
problem:
https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
and here is an example on StackOverflow. Note the most popular answer
given is to use the Computational Formula, which is the wrong answer.
http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
I would like to add a module to the standard library to solve these
sorts of simple stats problems the right way, once and for all.
Thoughts, comments, objections or words of encouragement are welcome.
--
Steven
More information about the Python-ideas
mailing list