On Tue, Sep 13, 2011 at 12:23 PM, Paul Moore <p.f.moore@gmail.com> wrote:
On 13 September 2011 05:06, Nick Coghlan <ncoghlan@gmail.com> wrote:
> On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <steve@pearwood.info> wrote:
>> I propose adding a basic calculator statistics module to the standard
>> library, similar to the sorts of functions you would get on a scientific
>> calculator:
>>
>> mean (average)
>> variance (population and sample)
>> standard deviation (population and sample)
>> correlation coefficient
>>
>> and similar. I am volunteering to provide, and support, this module, written
>> in pure Python so other implementations will be able to use it.
>>
>> Simple calculator-style statistics seem to me to be a fairly obvious
>> "battery" to be included, more useful in practice than some functions
>> already available such as factorial and the hyperbolic functions.
>
> And since some folks may not have seen it, Steven's proposal here is
> following up on a suggestion Raymond Hettinger posted to this last
> year:
>
> http://mail.python.org/pipermail/python-ideas/2010-October/008267.html
>
> >From my point of view, I'd make the following suggestions:
>
> 1. We should start very small (similar to the way itertools grew over time)
>
> To me that means:
>  mean, median, mode
>  variance
>  standard deviation
>
> Anything beyond that (including coroutine-style running calculations)
> is probably better left until 3.4. In the specific case of running
> calculations, this is to give us a chance to see how coroutine APIs
> are best written in a world where generators can return values as well
> as yielding them. Any APIs that would benefit from having access to
> running variants (such as being able to collect multiple statistics in
> a single pass) should also be postponed.
>
> Some more advanced algorithms could be included as recipes in the
> initial docs. The docs should also include pointers to more
> full-featured stats modules for reference when users needs outgrow the
> included batteries.
>
> 2. The 'math' module is not the place for this, a new, dedicated
> module is more appropriate. This is mainly due to the fact that the
> math module is focused primarily on binary floating point, while these
> algorithms should be neutral with regard to the specific numeric type
> involved. However, the practical issues with math being a builtin
> module are also a factor.
>
> There are many colours the naming bikeshed could be painted, but I'd
> be inclined to just call it 'statistics' ('statstools' is unwieldy,
> and other variants like 'stats', 'simplestats', 'statlib' and
> 'stats-tools' all exist on PyPI). Since the opportunity to just use
> the full word is there, we may as well take it.

+1 (both on the Steven's original suggestion, and Nick's follow-up comment).

I like the suggestion of having a running calculation version, but
agree that it's probably a bit soon to decide on the best API for such
things. Recipes in the documentation would be a good start, though.

In the past few months I've done some work on "running calculations" in Python, and came up with a module I call RunningCalcs:
http://pypi.python.org/pypi/RunningCalcs/
http://bitbucket.org/taleinat/runningcalcs/
It includes comprehensive tests and some benchmarks (in the wiki at BitBucket).

If "running calculations" are to be considered for inclusion in the stdlib, I propose RunningCalcs as an example implementation. Note that implementing calculations in this manner makes performing several calculations on a single iterable very easy and potentially efficient.

RunningCalcs includes implementations of a few calculations, including mean, variance and stdandard deviation, min & max, several summation algorithms and n-largest & n-smallest. Implementing a RunningCalc is simple and straight-forward. Usage is as follows:

# feeding inputs directly to the RunningCalc instances, one input at a time
mean_rc, stddev_rc = RunningMean(), RunningStdDev()
for x in inputs:
    mean_rc.feed(x)
    stddev_rc.feed(x)
mean, stddev = mean_rc.value, stddev_rc.value

# easy & fast calculation using apply_in_parallel()
a_i_p = apply_in_parallel
mean, stddev = a_i_p(inputs, [RunningMean(), RunningStdDev()])
small5, large5 = a_i_p(inputs, [RunningNSmallest(5), RunningNLargest(5)])

Regarding co-routines: During development I considered using co-routine-generators; my implementation of Kahan summation still uses such a generator. I've found this isn't a good generic method for implementing "running calculations", mainly because such a generator must return the current value at each iteration, even though this value is usually not needed nearly so often. For example, implementing a running version of n-largest using a co-routine/generator would introduce a large overhead, whereas my version is as fast as _heapq.nlargest (which is implemented in C -- see benchmarks for details).

- Tal Einat