[Python-ideas] Pre-PEP: adding a statistics module to Python

Steven D'Aprano steve at pearwood.info
Thu Aug 8 02:19:50 CEST 2013


On 08/08/13 08:24, Ethan Furman wrote:
> On 08/07/2013 03:20 PM, David Mertz wrote:
>>
>> Here's a question for the actual statisticians on the list (I'm not close to this).  Would having a look-ahead window of
>> moderate size (probably configurable) do enough good in numeric accuracy to be worthwhile?  Obviously, creating
>> pathological cases is still possible, but in the "normal" situation, does this matter enough?  I.e. if the function were
>> to read 100 numbers from an iterator, perform some manipulation on their ordering or scaling, produce that better
>> intermediate result, then do the same with the next chunk of 100 numbers, is this enough of a win to have as an option?
>
> I have a follow-up question:  considering the built-in error when calculating statistics, is the difference between sequence and iterator significant?  Would we be just as well served with `it = iter(sequence)` and always using the one-pass algorithm?

To the best of my knowledge, there is no widely-known algorithm for accurately calculating variance in chunks. There's a two-pass algorithm, and a one-pass algorithm, and that's what people use. My guess is that you could adapt the one-pass algorithm to look-ahead in chunks, but to my mind that adds complexity for no benefit.

Computationally, there is a small, but real, difference in the two algorithms, as can be seen by the fact that they give different results. Does the difference make a difference *in practice*, given that most real-world measurements are only accurate to 2-4 decimal places? No, not really. If your data is accurate to 3 decimal places, you've got no business quoting the variance to 15 decimal places.

But, people will compare the std lib variance to numpy's variance, to Excel's variance, to their calculator's variance, and when there is a discrepancy, I would rather it be in our favour rather than have to make the excuse "well you see we picked the slightly less accurate algorithm in order to prematurely optimize for enormous data sets too big to fit into memory".

The two-pass algorithm stays. I'm deferring to 3.5 a set of one-pass iterator friendly functions that will be suitable for calculating multiple statistics from a single data stream without building a list.



-- 
Steven


More information about the Python-ideas mailing list