
I'm working a module with some accumulation/reduction/statistical formulas: average(iterable): stddev(iterable, sample=False) product(iterable) nlargest(iterable, n=1) nsmallest(iterable, n=1) The questions that have arisen so far are: * What to call the module * What else should be in it? * Is "sample" a good keyword to distinguish from population stddev? * There seems to be a speed/space choice on how to implement nlargest/nsmallest. The faster way lists out the entire iterable, heapifies it, and pops off the top n elements. The slower way is less memory intensive: build only a n-length heap and then just do a heapreplace when necessary. Note, heapq is used for both (I use operator.neg to swap between largest and smallest). * Is there a way to compute the standard deviation without multiple passes over the data (one to compute the mean and a second to sum the squares of the differences from the mean? Raymond Hettinger

"Raymond Hettinger" <raymond.hettinger@verizon.net> writes:
Does that mean nlargest/nsmallest only work for numbers? I think it might be useful for e.g. strings too. Bernhard -- Intevation GmbH http://intevation.de/ Sketch http://sketch.sourceforge.net/ Thuban http://thuban.intevation.org/

On Tue, Jan 13, 2004, Raymond Hettinger wrote:
stats -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

* What to call the module
[Aahz]
stats
There is already a stat module. Any chance of confusion? The other naming issue is that some of the functions have non-statistical uses: product() is general purpose; nlargest() and nsmallest() will accept any datatype (though most of the use cases are with numbers). Are there other general purpose (non-statistical) accumulation/reduction formulas that go here?
* What else should be in it?
[Matthias Klose]
you may want to have a look at http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html
Ages ago, when the idea for this module first arose, a certain bot recommended strongly against including any but the most basic statistical functions (muttering something about the near impossibility of doing it well in either python or portable C and something about not wanting to maintain anything that wasn't dirt simple). His words would have of course fallen on deaf ears, but a certain dictatorial type had just finished teaching advanced programming skills to people who couldn't operate a high school calculator. Sooooo, no Kurtosis for you, no gamma function for me! It's possible that chi-square or regression could slip in, but it would require considerable cheerleading and a rare planetary alignment.
* What else should be in it?
[Jeremy]
median()
And a function like bins() or histogram() that accumulates the values in buckets of some size.
That sounds beginner simple and reasonably useful though it would have been nice if all the reduction formulas could work with one-pass and never need to manifest the whole dataset in memory.
Note, heapq is used for both (I use operator.neg to swap between largest and smallest).
[Bernhard Herzog]
Does that mean nlargest/nsmallest only work for numbers? I think it might be useful for e.g. strings too.
The plan was to make them work with anything defining __lt__; however, if it is coded in python and uses heapq, I don't see a straight-forward way around using operator.neg without wrapping everything in some sense reverser object. Raymond Hettinger

On Wed, Jan 14, 2004, Raymond Hettinger wrote:
Make it statistics, then.
Doesn't matter, IMO. The primary use-case for this module will be statistics; people looking for the more unusual functionality will still be likely to poke in this module for related uses. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

>>>> * What to call the module >> >> [Aahz] >>> stats >> >> There is already a stat module. Any chance of confusion? aahz> Make it statistics, then. Statistics is a rather big field. Maybe this should be a package, regardless what the final name is so that it would more easily allow future extensions. Skip

Aahz wrote:
There is considerable prior art here worth considering: see http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

On Wed, Jan 14, 2004 at 09:47:15PM -0800, David Ascher wrote:
*jumps into a dead thread* I wrote http://probstat.sourceforge.net/ which just does fast combination/permutation/cartesian products in C w/ a python binding. I'd be happy to throw a PSF license on it if you care for some or all of it. -jackdied

On Tue, 13 Jan 2004, Raymond Hettinger wrote:
Yes -- def one_pass_stddev(l): n = 0 x = 0. xx = 0. for y in l: n += 1 x += y xx += y*y x /= n xx /= n var = max(0,xx - x*x) dev = var**0.5 return dev Skewness and kurtosis can also be computed in one pass, though numerical stability problems can occur (even with std.dev) with these kinds of methods. -Kevin

Raymond Hettinger writes:
you may want to have a look at http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

[Raymond Hettinger]
I'm working a module with some accumulation/reduction/statistical formulas:
There are several statistical pkgs available for Python right now, so I hope you're scouring them for ideas and algorithms.
average(iterable): stddev(iterable, sample=False)
sample should probably default to True (since that's what most people really want, even if they don't realize it).
I think so, but I'm sure you're looking at what other pkgs do <wink>.
So long as n is significantly smaller than the # of elements in the iterable, I expect the latter way is also faster. For example, consider n=1, and let N be the # of incoming elements. Then N-1 comparisons are necessary and sufficient (to find the min, or the max). Using a heap of size 1 achieves that. Heapifying a list of N elements requires more comparisons than that ("it varies", and while heapifying is O(N), it's not exactly N). In effect, all the work a heap of size N does to distinguish among elements larger/smaller than the n smallest/largest so far is wasted, as those can have no effect on the final result.
The textbook formula does it in one pass, by accumulating running totals of the elements and of their squares. It's numerically atrocious, though, and should never be used (it can even end up delivering a negative result!). Knuth gives a much better recurrence formula for this, doing one-pass computation of variance in a numerically stable way.

"Raymond Hettinger" <raymond.hettinger@verizon.net> wrote in message news:00cf01c3da0e$b3a2e3e0$6e02a044@oemcomputer...
"Average" is a notoriously vague word. Prefer "mean" -- which opens the door for "median" and "mode" to join the party. I could have used a stats module a few months ago -- feels like a bit of an omission from the standard library, especially compared against the richness of the random module. James

James Kew wrote:
If this basic statistics module makes it into the standard library, perhaps the associated module docs could point to some of the more 'full-featured' statistical packages that are out there? Regards, Nick. -- Nick Coghlan | Brisbane, Australia Email: ncoghlan@email.com | Mobile: +61 409 573 268

"Raymond Hettinger" <raymond.hettinger@verizon.net> writes:
Does that mean nlargest/nsmallest only work for numbers? I think it might be useful for e.g. strings too. Bernhard -- Intevation GmbH http://intevation.de/ Sketch http://sketch.sourceforge.net/ Thuban http://thuban.intevation.org/

On Tue, Jan 13, 2004, Raymond Hettinger wrote:
stats -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

* What to call the module
[Aahz]
stats
There is already a stat module. Any chance of confusion? The other naming issue is that some of the functions have non-statistical uses: product() is general purpose; nlargest() and nsmallest() will accept any datatype (though most of the use cases are with numbers). Are there other general purpose (non-statistical) accumulation/reduction formulas that go here?
* What else should be in it?
[Matthias Klose]
you may want to have a look at http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html
Ages ago, when the idea for this module first arose, a certain bot recommended strongly against including any but the most basic statistical functions (muttering something about the near impossibility of doing it well in either python or portable C and something about not wanting to maintain anything that wasn't dirt simple). His words would have of course fallen on deaf ears, but a certain dictatorial type had just finished teaching advanced programming skills to people who couldn't operate a high school calculator. Sooooo, no Kurtosis for you, no gamma function for me! It's possible that chi-square or regression could slip in, but it would require considerable cheerleading and a rare planetary alignment.
* What else should be in it?
[Jeremy]
median()
And a function like bins() or histogram() that accumulates the values in buckets of some size.
That sounds beginner simple and reasonably useful though it would have been nice if all the reduction formulas could work with one-pass and never need to manifest the whole dataset in memory.
Note, heapq is used for both (I use operator.neg to swap between largest and smallest).
[Bernhard Herzog]
Does that mean nlargest/nsmallest only work for numbers? I think it might be useful for e.g. strings too.
The plan was to make them work with anything defining __lt__; however, if it is coded in python and uses heapq, I don't see a straight-forward way around using operator.neg without wrapping everything in some sense reverser object. Raymond Hettinger

On Wed, Jan 14, 2004, Raymond Hettinger wrote:
Make it statistics, then.
Doesn't matter, IMO. The primary use-case for this module will be statistics; people looking for the more unusual functionality will still be likely to poke in this module for related uses. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ A: No. Q: Is top-posting okay?

>>>> * What to call the module >> >> [Aahz] >>> stats >> >> There is already a stat module. Any chance of confusion? aahz> Make it statistics, then. Statistics is a rather big field. Maybe this should be a package, regardless what the final name is so that it would more easily allow future extensions. Skip

Aahz wrote:
There is considerable prior art here worth considering: see http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

On Wed, Jan 14, 2004 at 09:47:15PM -0800, David Ascher wrote:
*jumps into a dead thread* I wrote http://probstat.sourceforge.net/ which just does fast combination/permutation/cartesian products in C w/ a python binding. I'd be happy to throw a PSF license on it if you care for some or all of it. -jackdied

On Tue, 13 Jan 2004, Raymond Hettinger wrote:
Yes -- def one_pass_stddev(l): n = 0 x = 0. xx = 0. for y in l: n += 1 x += y xx += y*y x /= n xx /= n var = max(0,xx - x*x) dev = var**0.5 return dev Skewness and kurtosis can also be computed in one pass, though numerical stability problems can occur (even with std.dev) with these kinds of methods. -Kevin

Raymond Hettinger writes:
you may want to have a look at http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

[Raymond Hettinger]
I'm working a module with some accumulation/reduction/statistical formulas:
There are several statistical pkgs available for Python right now, so I hope you're scouring them for ideas and algorithms.
average(iterable): stddev(iterable, sample=False)
sample should probably default to True (since that's what most people really want, even if they don't realize it).
I think so, but I'm sure you're looking at what other pkgs do <wink>.
So long as n is significantly smaller than the # of elements in the iterable, I expect the latter way is also faster. For example, consider n=1, and let N be the # of incoming elements. Then N-1 comparisons are necessary and sufficient (to find the min, or the max). Using a heap of size 1 achieves that. Heapifying a list of N elements requires more comparisons than that ("it varies", and while heapifying is O(N), it's not exactly N). In effect, all the work a heap of size N does to distinguish among elements larger/smaller than the n smallest/largest so far is wasted, as those can have no effect on the final result.
The textbook formula does it in one pass, by accumulating running totals of the elements and of their squares. It's numerically atrocious, though, and should never be used (it can even end up delivering a negative result!). Knuth gives a much better recurrence formula for this, doing one-pass computation of variance in a numerically stable way.

"Raymond Hettinger" <raymond.hettinger@verizon.net> wrote in message news:00cf01c3da0e$b3a2e3e0$6e02a044@oemcomputer...
"Average" is a notoriously vague word. Prefer "mean" -- which opens the door for "median" and "mode" to join the party. I could have used a stats module a few months ago -- feels like a bit of an omission from the standard library, especially compared against the richness of the random module. James

James Kew wrote:
If this basic statistics module makes it into the standard library, perhaps the associated module docs could point to some of the more 'full-featured' statistical packages that are out there? Regards, Nick. -- Nick Coghlan | Brisbane, Australia Email: ncoghlan@email.com | Mobile: +61 409 573 268
participants (14)
-
Aahz
-
Bernhard Herzog
-
David Ascher
-
Jack Diederich
-
James Kew
-
Jeremy Hylton
-
Kevin Jacobs
-
Matthias Klose
-
Nick Coghlan
-
Raymond Hettinger
-
Raymond Hettinger
-
Skip Montanaro
-
Thomas Heller
-
Tim Peters