<div dir="ltr"><div class="gmail_quote">On Tue, Sep 13, 2011 at 12:23 PM, Paul Moore <span dir="ltr"><<a href="mailto:p.f.moore@gmail.com" target="_blank">p.f.moore@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><div></div><div>On 13 September 2011 05:06, Nick Coghlan <<a href="mailto:ncoghlan@gmail.com" target="_blank">ncoghlan@gmail.com</a>> wrote:<br>

> On Tue, Sep 13, 2011 at 11:00 AM, Steven D'Aprano <<a href="mailto:steve@pearwood.info" target="_blank">steve@pearwood.info</a>> wrote:<br>

>> I propose adding a basic calculator statistics module to the standard<br>

>> library, similar to the sorts of functions you would get on a scientific<br>

>> calculator:<br>

>><br>

>> mean (average)<br>

>> variance (population and sample)<br>

>> standard deviation (population and sample)<br>

>> correlation coefficient<br>

>><br>

>> and similar. I am volunteering to provide, and support, this module, written<br>

>> in pure Python so other implementations will be able to use it.<br>

>><br>

>> Simple calculator-style statistics seem to me to be a fairly obvious<br>

>> "battery" to be included, more useful in practice than some functions<br>

>> already available such as factorial and the hyperbolic functions.<br>

><br>

> And since some folks may not have seen it, Steven's proposal here is<br>

> following up on a suggestion Raymond Hettinger posted to this last<br>

> year:<br>

><br>

> <a href="http://mail.python.org/pipermail/python-ideas/2010-October/008267.html" target="_blank">http://mail.python.org/pipermail/python-ideas/2010-October/008267.html</a><br>

><br>

> >From my point of view, I'd make the following suggestions:<br>

><br>

> 1. We should start very small (similar to the way itertools grew over time)<br>

><br>

> To me that means:<br>

>  mean, median, mode<br>

>  variance<br>

>  standard deviation<br>

><br>

> Anything beyond that (including coroutine-style running calculations)<br>

> is probably better left until 3.4. In the specific case of running<br>

> calculations, this is to give us a chance to see how coroutine APIs<br>

> are best written in a world where generators can return values as well<br>

> as yielding them. Any APIs that would benefit from having access to<br>

> running variants (such as being able to collect multiple statistics in<br>

> a single pass) should also be postponed.<br>

><br>

> Some more advanced algorithms could be included as recipes in the<br>

> initial docs. The docs should also include pointers to more<br>

> full-featured stats modules for reference when users needs outgrow the<br>

> included batteries.<br>

><br>

> 2. The 'math' module is not the place for this, a new, dedicated<br>

> module is more appropriate. This is mainly due to the fact that the<br>

> math module is focused primarily on binary floating point, while these<br>

> algorithms should be neutral with regard to the specific numeric type<br>

> involved. However, the practical issues with math being a builtin<br>

> module are also a factor.<br>

><br>

> There are many colours the naming bikeshed could be painted, but I'd<br>

> be inclined to just call it 'statistics' ('statstools' is unwieldy,<br>

> and other variants like 'stats', 'simplestats', 'statlib' and<br>

> 'stats-tools' all exist on PyPI). Since the opportunity to just use<br>

> the full word is there, we may as well take it.<br>

<br>

</div></div>+1 (both on the Steven's original suggestion, and Nick's follow-up comment).<br>

<br>

I like the suggestion of having a running calculation version, but<br>

agree that it's probably a bit soon to decide on the best API for such<br>

things. Recipes in the documentation would be a good start, though.<br></blockquote><div><br>In the past few months I've done some work on "running calculations" in Python, and came up with a module I call RunningCalcs:<br>


<a href="http://pypi.python.org/pypi/RunningCalcs/" target="_blank">http://pypi.python.org/pypi/RunningCalcs/</a><br><a href="http://bitbucket.org/taleinat/runningcalcs/" target="_blank">http://bitbucket.org/taleinat/runningcalcs/</a><br>


It includes comprehensive tests and some benchmarks (in the wiki at BitBucket).<br><br>If "running calculations" are to be considered for inclusion in the stdlib, I propose RunningCalcs as an example implementation. Note that implementing calculations in this manner makes performing several calculations on a single iterable very easy and potentially efficient.<br>


<br>RunningCalcs includes implementations of a few calculations, including mean, variance and stdandard deviation, min & max, several summation algorithms and n-largest & n-smallest. Implementing a RunningCalc is simple and straight-forward. Usage is as follows:<br>


<br># feeding inputs directly to the RunningCalc instances, one input at a time<br>mean_rc, stddev_rc = RunningMean(), RunningStdDev()<br>for x in inputs:<br>    mean_rc.feed(x)<br>    stddev_rc.feed(x)<br>mean, stddev = mean_rc.value, stddev_rc.value<br>


<br># easy & fast calculation using apply_in_parallel()<br>a_i_p = apply_in_parallel<br>mean, stddev = a_i_p(inputs, [RunningMean(), RunningStdDev()])<br>small5, large5 = a_i_p(inputs, [RunningNSmallest(5), RunningNLargest(5)])<br>


<br>Regarding co-routines: During development I considered using co-routine-generators; my implementation of Kahan summation still uses such a generator. I've found this isn't a good generic method for implementing "running calculations", mainly because such a generator must return the current value at each iteration, even though this value is usually not needed nearly so often. For example, implementing a running version of n-largest using a co-routine/generator would introduce a large overhead, whereas my version is as fast as _heapq.nlargest (which is implemented in C -- see benchmarks for details).<br>


<br>- Tal Einat<br></div></div></div>