
[Raymond Hettinger]
I'm working a module with some accumulation/reduction/statistical formulas:
There are several statistical pkgs available for Python right now, so I hope you're scouring them for ideas and algorithms.
average(iterable): stddev(iterable, sample=False)
sample should probably default to True (since that's what most people really want, even if they don't realize it).
product(iterable) nlargest(iterable, n=1) nsmallest(iterable, n=1)
The questions that have arisen so far are:
* What to call the module
* What else should be in it?
* Is "sample" a good keyword to distinguish from population stddev?
I think so, but I'm sure you're looking at what other pkgs do <wink>.
* There seems to be a speed/space choice on how to implement nlargest/nsmallest. The faster way lists out the entire iterable, heapifies it, and pops off the top n elements. The slower way is less memory intensive: build only a n-length heap and then just do a heapreplace when necessary. Note, heapq is used for both (I use operator.neg to swap between largest and smallest).
So long as n is significantly smaller than the # of elements in the iterable, I expect the latter way is also faster. For example, consider n=1, and let N be the # of incoming elements. Then N-1 comparisons are necessary and sufficient (to find the min, or the max). Using a heap of size 1 achieves that. Heapifying a list of N elements requires more comparisons than that ("it varies", and while heapifying is O(N), it's not exactly N). In effect, all the work a heap of size N does to distinguish among elements larger/smaller than the n smallest/largest so far is wasted, as those can have no effect on the final result.
* Is there a way to compute the standard deviation without multiple passes over the data (one to compute the mean and a second to sum the squares of the differences from the mean?
The textbook formula does it in one pass, by accumulating running totals of the elements and of their squares. It's numerically atrocious, though, and should never be used (it can even end up delivering a negative result!). Knuth gives a much better recurrence formula for this, doing one-pass computation of variance in a numerically stable way.