Re: [Python-ideas] Proposal: add a calculator statistics module

Sept. 13, 2011

      On 13 September 2011 05:06, Nick Coghlan <ncoghlan@gmail.com> wrote:
...
On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve@pearwood.info> wrote:
...
I propose adding a basic calculator statistics module to the standard
library, similar to the sorts of functions you would get on a scientific
calculator:
mean (average)
variance (population and sample)
standard deviation (population and sample)
correlation coefficient
and similar. I am volunteering to provide, and support, this module, written
in pure Python so other implementations will be able to use it.
Simple calculator-style statistics seem to me to be a fairly obvious
"battery" to be included, more useful in practice than some functions
already available such as factorial and the hyperbolic functions.
And since some folks may not have seen it, Steven's proposal here is
following up on a suggestion Raymond Hettinger posted to this last
year:
http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
...
From my point of view, I'd make the following suggestions:
1. We should start very small (similar to the way itertools grew over time)
To me that means:
 mean, median, mode
 variance
 standard deviation
Anything beyond that (including coroutine-style running calculations)
is probably better left until 3.4. In the specific case of running
calculations, this is to give us a chance to see how coroutine APIs
are best written in a world where generators can return values as well
as yielding them. Any APIs that would benefit from having access to
running variants (such as being able to collect multiple statistics in
a single pass) should also be postponed.
Some more advanced algorithms could be included as recipes in the
initial docs. The docs should also include pointers to more
full-featured stats modules for reference when users needs outgrow the
included batteries.
2. The 'math' module is not the place for this, a new, dedicated
module is more appropriate. This is mainly due to the fact that the
math module is focused primarily on binary floating point, while these
algorithms should be neutral with regard to the specific numeric type
involved. However, the practical issues with math being a builtin
module are also a factor.
There are many colours the naming bikeshed could be painted, but I'd
be inclined to just call it 'statistics' ('statstools' is unwieldy,
and other variants like 'stats', 'simplestats', 'statlib' and
'stats-tools' all exist on PyPI). Since the opportunity to just use
the full word is there, we may as well take it.
+1 (both on the Steven's original suggestion, and Nick's follow-up comment).

I like the suggestion of having a running calculation version, but
agree that it's probably a bit soon to decide on the best API for such
things. Recipes in the documentation would be a good start, though.

One place I'd disagree with Nick, though, I'd like to see correlation
coefficient and linear regression in there. They are common on
calculators, and I do tend to use them reasonably often. Please save
me from starting Excel to calculate them! :-)

Paul.

Re: [Python-ideas] Proposal: add a calculator statistics module

Paul Moore