
On 13 September 2011 05:06, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I propose adding a basic calculator statistics module to the standard library, similar to the sorts of functions you would get on a scientific calculator:
mean (average) variance (population and sample) standard deviation (population and sample) correlation coefficient
and similar. I am volunteering to provide, and support, this module, written in pure Python so other implementations will be able to use it.
Simple calculator-style statistics seem to me to be a fairly obvious "battery" to be included, more useful in practice than some functions already available such as factorial and the hyperbolic functions.
And since some folks may not have seen it, Steven's proposal here is following up on a suggestion Raymond Hettinger posted to this last year:
http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
From my point of view, I'd make the following suggestions:
1. We should start very small (similar to the way itertools grew over time)
To me that means: mean, median, mode variance standard deviation
Anything beyond that (including coroutine-style running calculations) is probably better left until 3.4. In the specific case of running calculations, this is to give us a chance to see how coroutine APIs are best written in a world where generators can return values as well as yielding them. Any APIs that would benefit from having access to running variants (such as being able to collect multiple statistics in a single pass) should also be postponed.
Some more advanced algorithms could be included as recipes in the initial docs. The docs should also include pointers to more full-featured stats modules for reference when users needs outgrow the included batteries.
2. The 'math' module is not the place for this, a new, dedicated module is more appropriate. This is mainly due to the fact that the math module is focused primarily on binary floating point, while these algorithms should be neutral with regard to the specific numeric type involved. However, the practical issues with math being a builtin module are also a factor.
There are many colours the naming bikeshed could be painted, but I'd be inclined to just call it 'statistics' ('statstools' is unwieldy, and other variants like 'stats', 'simplestats', 'statlib' and 'stats-tools' all exist on PyPI). Since the opportunity to just use the full word is there, we may as well take it.
+1 (both on the Steven's original suggestion, and Nick's follow-up comment). I like the suggestion of having a running calculation version, but agree that it's probably a bit soon to decide on the best API for such things. Recipes in the documentation would be a good start, though. One place I'd disagree with Nick, though, I'd like to see correlation coefficient and linear regression in there. They are common on calculators, and I do tend to use them reasonably often. Please save me from starting Excel to calculate them! :-) Paul.