Proposal: add a calculator statistics module

I propose adding a basic calculator statistics module to the standard library, similar to the sorts of functions you would get on a scientific calculator: mean (average) variance (population and sample) standard deviation (population and sample) correlation coefficient and similar. I am volunteering to provide, and support, this module, written in pure Python so other implementations will be able to use it. Simple calculator-style statistics seem to me to be a fairly obvious "battery" to be included, more useful in practice than some functions already available such as factorial and the hyperbolic functions. The lack of a standard solution leads people who need basic stats to roll their own. This seems seductively simple, as the basic stats formulae are quite simple. Unfortunately doing it *correctly* is much harder than it seems. Variance, in particular, is prone to serious inaccuracies. Here is the most obvious algorithm, using the so-called "computational formula for the variance": def variance(data): # σ2 = 1/n**2 * (n*Σ(x**2) - (Σx)**2) n = len(data) s1 = sum(x**2 for x in data) s2 = sum(data) return (n*s1 - s2**2)/(n*n) Many stats text books recommend this as the best way to calculate variance, advice which makes sense when you're talking about hand calculations of small numbers of moderate sized data, but not for floating point. It appears to work:
but unfortunately it is numerically unstable. Shifting all the data points by a constant amount shouldn't change the variance, but it does:
Even worse, variance should never be negative:
variance(data*100) -1266637395.197952
Note that using math.fsum instead of the built-in sum does not fix the numeric instability problem, and it adds the additional problem that it coerces the data points to float. (If you use Decimal, this may not be what you want.) Here is an example of published code which suffers from exactly this problem: https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py and here is an example on StackOverflow. Note the most popular answer given is to use the Computational Formula, which is the wrong answer. http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-... I would like to add a module to the standard library to solve these sorts of simple stats problems the right way, once and for all. Thoughts, comments, objections or words of encouragement are welcome. -- Steven

You only need: def E(f,data): return sum(f(x) for x in data)/len(data) Than you can compute ANY expectation value data = range(0,10) average = E(lambda x:x, data) variance = E(lambda x:(x-mu)**2,data) skewness = E(lambda x:(x-mu)**3,data)/variance**(2.0/3.0) X = [random() for i in range(N)] Y = [random() for i in range(N)] XY = [(X[i],Y[i]) for i in range(N)] covariance = E[lambda x,y: x*y, XY] - E(lambda x,y:x, XY)*E(lambda x,y:y, XY) Hope it makes sense. etc.etc. Massimo On Sep 12, 2011, at 8:00 PM, Steven D'Aprano wrote:

I like the idea of including batteries for statistics. Not to leave without shedding a bike - I know the "math" module says it "provides access to the mathematical functions defined by the C standard", but I think it's the perfect home for such functionality. http://docs.python.org/library/math.html begs for another header named "statistical functions". --Yuval Greenfield

Steven D'Aprano writes:
+1 for these.
and similar.
I'm not sure about that. I immediately thought "how about bivariate linear regression?", but that's actually misleading in many cases. Doing anything more general is beyond the scope of "like a typical scientific/statistical hand calculator". On the other hand, well-commented code showing how to do the bivariate case would help people to implement more general cases. Ditto for other common tasks like chi-square tests.

Technically linear regression can be implemented in one line of code assuming a good linear algebra package: http://code.google.com/p/fin2py/source/browse/numeric.py#791 (here used for fitting polynomials or any other bivariate or multivariate function linear in coefficients) Unless python is going to have a built-in linear algebra packages, I think linear regressions fits better in scipy. Massimo On Sep 12, 2011, at 9:35 PM, Stephen J. Turnbull wrote:

On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve@pearwood.info> wrote:
And since some folks may not have seen it, Steven's proposal here is following up on a suggestion Raymond Hettinger posted to this last year: http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
From my point of view, I'd make the following suggestions:
1. We should start very small (similar to the way itertools grew over time) To me that means: mean, median, mode variance standard deviation Anything beyond that (including coroutine-style running calculations) is probably better left until 3.4. In the specific case of running calculations, this is to give us a chance to see how coroutine APIs are best written in a world where generators can return values as well as yielding them. Any APIs that would benefit from having access to running variants (such as being able to collect multiple statistics in a single pass) should also be postponed. Some more advanced algorithms could be included as recipes in the initial docs. The docs should also include pointers to more full-featured stats modules for reference when users needs outgrow the included batteries. 2. The 'math' module is not the place for this, a new, dedicated module is more appropriate. This is mainly due to the fact that the math module is focused primarily on binary floating point, while these algorithms should be neutral with regard to the specific numeric type involved. However, the practical issues with math being a builtin module are also a factor. There are many colours the naming bikeshed could be painted, but I'd be inclined to just call it 'statistics' ('statstools' is unwieldy, and other variants like 'stats', 'simplestats', 'statlib' and 'stats-tools' all exist on PyPI). Since the opportunity to just use the full word is there, we may as well take it. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 13 September 2011 05:06, Nick Coghlan <ncoghlan@gmail.com> wrote:
+1 (both on the Steven's original suggestion, and Nick's follow-up comment). I like the suggestion of having a running calculation version, but agree that it's probably a bit soon to decide on the best API for such things. Recipes in the documentation would be a good start, though. One place I'd disagree with Nick, though, I'd like to see correlation coefficient and linear regression in there. They are common on calculators, and I do tend to use them reasonably often. Please save me from starting Excel to calculate them! :-) Paul.

On Tue, Sep 13, 2011 at 7:23 PM, Paul Moore <p.f.moore@gmail.com> wrote:
I'm not strongly against those two, I mainly wanted to emphasise the "start small and grow" aspect to try to keep the initial version focused and maintainable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Sep 13, 2011 at 12:23 PM, Paul Moore <p.f.moore@gmail.com> wrote:
In the past few months I've done some work on "running calculations" in Python, and came up with a module I call RunningCalcs: http://pypi.python.org/pypi/RunningCalcs/ http://bitbucket.org/taleinat/runningcalcs/ It includes comprehensive tests and some benchmarks (in the wiki at BitBucket). If "running calculations" are to be considered for inclusion in the stdlib, I propose RunningCalcs as an example implementation. Note that implementing calculations in this manner makes performing several calculations on a single iterable very easy and potentially efficient. RunningCalcs includes implementations of a few calculations, including mean, variance and stdandard deviation, min & max, several summation algorithms and n-largest & n-smallest. Implementing a RunningCalc is simple and straight-forward. Usage is as follows: # feeding inputs directly to the RunningCalc instances, one input at a time mean_rc, stddev_rc = RunningMean(), RunningStdDev() for x in inputs: mean_rc.feed(x) stddev_rc.feed(x) mean, stddev = mean_rc.value, stddev_rc.value # easy & fast calculation using apply_in_parallel() a_i_p = apply_in_parallel mean, stddev = a_i_p(inputs, [RunningMean(), RunningStdDev()]) small5, large5 = a_i_p(inputs, [RunningNSmallest(5), RunningNLargest(5)]) Regarding co-routines: During development I considered using co-routine-generators; my implementation of Kahan summation still uses such a generator. I've found this isn't a good generic method for implementing "running calculations", mainly because such a generator must return the current value at each iteration, even though this value is usually not needed nearly so often. For example, implementing a running version of n-largest using a co-routine/generator would introduce a large overhead, whereas my version is as fast as _heapq.nlargest (which is implemented in C -- see benchmarks for details). - Tal Einat

On Tue, Sep 13, 2011 at 12:06 AM, Nick Coghlan <ncoghlan@gmail.com> wrote: ..
While I agree that a new dedicated module is an appropriate place for this functionality, I am concerned that generic type-neutral algorithms may be sub-optimal for float-only calculations. I think at least at the start, the new module should follow the same design as the math module - namely provide numerically stable efficient algorithms for double precision floating point data. Other types should be supported by coercing to float. This means that support for long integers, decimal type, fractions, complex numbers will necessarily be limited. For example, you will not be able to compute average of 100-digit integers accurately even when the exact answer is an integer. Once the semantics of operations on general iterables yielding numbers are in place, the first optimization should probably be to special case operations on arrays of doubles. Proper support for decimal type, fractions, etc would probably need a separate project.

You only need: def E(f,data): return sum(f(x) for x in data)/len(data) Than you can compute ANY expectation value data = range(0,10) average = E(lambda x:x, data) variance = E(lambda x:(x-mu)**2,data) skewness = E(lambda x:(x-mu)**3,data)/variance**(2.0/3.0) X = [random() for i in range(N)] Y = [random() for i in range(N)] XY = [(X[i],Y[i]) for i in range(N)] covariance = E[lambda x,y: x*y, XY] - E(lambda x,y:x, XY)*E(lambda x,y:y, XY) Hope it makes sense. etc.etc. Massimo On Sep 12, 2011, at 8:00 PM, Steven D'Aprano wrote:

I like the idea of including batteries for statistics. Not to leave without shedding a bike - I know the "math" module says it "provides access to the mathematical functions defined by the C standard", but I think it's the perfect home for such functionality. http://docs.python.org/library/math.html begs for another header named "statistical functions". --Yuval Greenfield

Steven D'Aprano writes:
+1 for these.
and similar.
I'm not sure about that. I immediately thought "how about bivariate linear regression?", but that's actually misleading in many cases. Doing anything more general is beyond the scope of "like a typical scientific/statistical hand calculator". On the other hand, well-commented code showing how to do the bivariate case would help people to implement more general cases. Ditto for other common tasks like chi-square tests.

Technically linear regression can be implemented in one line of code assuming a good linear algebra package: http://code.google.com/p/fin2py/source/browse/numeric.py#791 (here used for fitting polynomials or any other bivariate or multivariate function linear in coefficients) Unless python is going to have a built-in linear algebra packages, I think linear regressions fits better in scipy. Massimo On Sep 12, 2011, at 9:35 PM, Stephen J. Turnbull wrote:

On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve@pearwood.info> wrote:
And since some folks may not have seen it, Steven's proposal here is following up on a suggestion Raymond Hettinger posted to this last year: http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
From my point of view, I'd make the following suggestions:
1. We should start very small (similar to the way itertools grew over time) To me that means: mean, median, mode variance standard deviation Anything beyond that (including coroutine-style running calculations) is probably better left until 3.4. In the specific case of running calculations, this is to give us a chance to see how coroutine APIs are best written in a world where generators can return values as well as yielding them. Any APIs that would benefit from having access to running variants (such as being able to collect multiple statistics in a single pass) should also be postponed. Some more advanced algorithms could be included as recipes in the initial docs. The docs should also include pointers to more full-featured stats modules for reference when users needs outgrow the included batteries. 2. The 'math' module is not the place for this, a new, dedicated module is more appropriate. This is mainly due to the fact that the math module is focused primarily on binary floating point, while these algorithms should be neutral with regard to the specific numeric type involved. However, the practical issues with math being a builtin module are also a factor. There are many colours the naming bikeshed could be painted, but I'd be inclined to just call it 'statistics' ('statstools' is unwieldy, and other variants like 'stats', 'simplestats', 'statlib' and 'stats-tools' all exist on PyPI). Since the opportunity to just use the full word is there, we may as well take it. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 13 September 2011 05:06, Nick Coghlan <ncoghlan@gmail.com> wrote:
+1 (both on the Steven's original suggestion, and Nick's follow-up comment). I like the suggestion of having a running calculation version, but agree that it's probably a bit soon to decide on the best API for such things. Recipes in the documentation would be a good start, though. One place I'd disagree with Nick, though, I'd like to see correlation coefficient and linear regression in there. They are common on calculators, and I do tend to use them reasonably often. Please save me from starting Excel to calculate them! :-) Paul.

On Tue, Sep 13, 2011 at 7:23 PM, Paul Moore <p.f.moore@gmail.com> wrote:
I'm not strongly against those two, I mainly wanted to emphasise the "start small and grow" aspect to try to keep the initial version focused and maintainable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Sep 13, 2011 at 12:23 PM, Paul Moore <p.f.moore@gmail.com> wrote:
In the past few months I've done some work on "running calculations" in Python, and came up with a module I call RunningCalcs: http://pypi.python.org/pypi/RunningCalcs/ http://bitbucket.org/taleinat/runningcalcs/ It includes comprehensive tests and some benchmarks (in the wiki at BitBucket). If "running calculations" are to be considered for inclusion in the stdlib, I propose RunningCalcs as an example implementation. Note that implementing calculations in this manner makes performing several calculations on a single iterable very easy and potentially efficient. RunningCalcs includes implementations of a few calculations, including mean, variance and stdandard deviation, min & max, several summation algorithms and n-largest & n-smallest. Implementing a RunningCalc is simple and straight-forward. Usage is as follows: # feeding inputs directly to the RunningCalc instances, one input at a time mean_rc, stddev_rc = RunningMean(), RunningStdDev() for x in inputs: mean_rc.feed(x) stddev_rc.feed(x) mean, stddev = mean_rc.value, stddev_rc.value # easy & fast calculation using apply_in_parallel() a_i_p = apply_in_parallel mean, stddev = a_i_p(inputs, [RunningMean(), RunningStdDev()]) small5, large5 = a_i_p(inputs, [RunningNSmallest(5), RunningNLargest(5)]) Regarding co-routines: During development I considered using co-routine-generators; my implementation of Kahan summation still uses such a generator. I've found this isn't a good generic method for implementing "running calculations", mainly because such a generator must return the current value at each iteration, even though this value is usually not needed nearly so often. For example, implementing a running version of n-largest using a co-routine/generator would introduce a large overhead, whereas my version is as fast as _heapq.nlargest (which is implemented in C -- see benchmarks for details). - Tal Einat

On Tue, Sep 13, 2011 at 12:06 AM, Nick Coghlan <ncoghlan@gmail.com> wrote: ..
While I agree that a new dedicated module is an appropriate place for this functionality, I am concerned that generic type-neutral algorithms may be sub-optimal for float-only calculations. I think at least at the start, the new module should follow the same design as the math module - namely provide numerically stable efficient algorithms for double precision floating point data. Other types should be supported by coercing to float. This means that support for long integers, decimal type, fractions, complex numbers will necessarily be limited. For example, you will not be able to compute average of 100-digit integers accurately even when the exact answer is an integer. Once the semantics of operations on general iterables yielding numbers are in place, the first optimization should probably be to special case operations on arrays of doubles. Proper support for decimal type, fractions, etc would probably need a separate project.
participants (10)
-
Alexander Belopolsky
-
Eli Bendersky
-
Ethan Furman
-
Massimo Di Pierro
-
Nick Coghlan
-
Paul Moore
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tal Einat
-
Yuval Greenfield