[Python-ideas] Proposal: add a calculator statistics module
Massimo Di Pierro
massimo.dipierro at gmail.com
Tue Sep 13 03:14:47 CEST 2011
You only need:
def E(f,data): return sum(f(x) for x in data)/len(data)
Than you can compute ANY expectation value
data = range(0,10)
average = E(lambda x:x, data)
variance = E(lambda x:(x-mu)**2,data)
skewness = E(lambda x:(x-mu)**3,data)/variance**(2.0/3.0)
X = [random() for i in range(N)]
Y = [random() for i in range(N)]
XY = [(X[i],Y[i]) for i in range(N)]
covariance = E[lambda x,y: x*y, XY] - E(lambda x,y:x, XY)*E(lambda x,y:y, XY)
Hope it makes sense.
etc.etc.
Massimo
On Sep 12, 2011, at 8:00 PM, Steven D'Aprano wrote:
> I propose adding a basic calculator statistics module to the standard library, similar to the sorts of functions you would get on a scientific calculator:
>
> mean (average)
> variance (population and sample)
> standard deviation (population and sample)
> correlation coefficient
>
> and similar. I am volunteering to provide, and support, this module, written in pure Python so other implementations will be able to use it.
>
> Simple calculator-style statistics seem to me to be a fairly obvious "battery" to be included, more useful in practice than some functions already available such as factorial and the hyperbolic functions.
>
> The lack of a standard solution leads people who need basic stats to roll their own. This seems seductively simple, as the basic stats formulae are quite simple. Unfortunately doing it *correctly* is much harder than it seems. Variance, in particular, is prone to serious inaccuracies. Here is the most obvious algorithm, using the so-called "computational formula for the variance":
>
>
> def variance(data):
> # σ2 = 1/n**2 * (n*Σ(x**2) - (Σx)**2)
> n = len(data)
> s1 = sum(x**2 for x in data)
> s2 = sum(data)
> return (n*s1 - s2**2)/(n*n)
>
> Many stats text books recommend this as the best way to calculate variance, advice which makes sense when you're talking about hand calculations of small numbers of moderate sized data, but not for floating point. It appears to work:
>
> >>> data = [1, 2, 4, 5, 8]
> >>> variance(data) # exact value = 6
> 6.0
>
> but unfortunately it is numerically unstable. Shifting all the data points by a constant amount shouldn't change the variance, but it does:
>
> >>> data = [x+1e12 for x in data]
> >>> variance(data)
> 171798691.84
>
> Even worse, variance should never be negative:
>
> >>> variance(data*100)
> -1266637395.197952
>
> Note that using math.fsum instead of the built-in sum does not fix the numeric instability problem, and it adds the additional problem that it coerces the data points to float. (If you use Decimal, this may not be what you want.)
>
> Here is an example of published code which suffers from exactly this problem:
>
> https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
>
> and here is an example on StackOverflow. Note the most popular answer given is to use the Computational Formula, which is the wrong answer.
>
> http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
>
> I would like to add a module to the standard library to solve these sorts of simple stats problems the right way, once and for all.
>
> Thoughts, comments, objections or words of encouragement are welcome.
>
>
>
> --
> Steven
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
More information about the Python-ideas
mailing list