[Python-Dev] Accumulation module

Tim Peters tim.one at comcast.net
Tue Jan 13 15:59:18 EST 2004


[Raymond Hettinger]
> I'm working a module with some accumulation/reduction/statistical
> formulas:

There are several statistical pkgs available for Python right now, so I hope
you're scouring them for ideas and algorithms.

> average(iterable):
> stddev(iterable, sample=False)

sample should probably default to True (since that's what most people really
want, even if they don't realize it).

> product(iterable)
> nlargest(iterable, n=1)
> nsmallest(iterable, n=1)
>
> The questions that have arisen so far are:
>
> * What to call the module
>
> * What else should be in it?
>
> * Is "sample" a good keyword to distinguish from population stddev?

I think so, but I'm sure you're looking at what other pkgs do <wink>.

> * There seems to be a speed/space choice on how to implement
> nlargest/nsmallest.  The faster way lists out the entire iterable,
> heapifies it, and pops off the top n elements.  The slower way is less
> memory intensive:  build only a n-length heap and then just do a
> heapreplace when necessary.  Note, heapq is used for both (I use
> operator.neg to swap between largest and smallest).

So long as n is significantly smaller than the # of elements in the
iterable, I expect the latter way is also faster.  For example, consider
n=1, and let N be the # of incoming elements.  Then N-1 comparisons are
necessary and sufficient (to find the min, or the max).  Using a heap of
size 1 achieves that.  Heapifying a list of N elements requires more
comparisons than that ("it varies", and while heapifying is O(N), it's not
exactly N).  In effect, all the work a heap of size N does to distinguish
among elements larger/smaller than the n smallest/largest so far is wasted,
as those can have no effect on the final result.

> * Is there a way to compute the standard deviation without multiple
> passes over the data (one to compute the mean and a second to sum the
> squares of the differences from the mean?

The textbook formula does it in one pass, by accumulating running totals of
the elements and of their squares.  It's numerically atrocious, though, and
should never be used (it can even end up delivering a negative result!).
Knuth gives a much better recurrence formula for this, doing one-pass
computation of variance in a numerically stable way.




More information about the Python-Dev mailing list