[Python-ideas] Pre-PEP: adding a statistics module to Python

Wed Aug 7 03:34:29 CEST 2013

On 07/08/13 01:49, Oscar Benjamin wrote:

> Taking the example from the PEP:
>
>>>> from statistics import *
>>>> data = [1, 2, 4, 5, 8]
>>>> data = [x+1e12 for x in data]
>>>> variance(data)
> 7.5
>
> However:
>
>>>> variance(iter(data))
> 7.4999542236328125
>
> Okay so that's a small difference and it's unlikely to upset many
> people. But being something of a numerical obsessive I do often get
> upset about things like this. It's not that I mind the size of the
> error but rather that I dislike having the calculation implicitly
> changed. I want to think that it doesn't matter whether I pass an
> iterator or a list because either I get an error or I get the same
> result.

That's fantastic feedback and exactly the sort of thing I want to hear :-)

This is mentioned under "Design Decisions" in the PEP, and treated as a feature, but I'm open to revising that behaviour. 3.4 feature-freeze is quite close, and I don't want to hold up acceptance of the PEP (which doesn't even have a number yet!) for one-pass stats calculations. So I'm going to take this approach:

- The difference between variance(list(data)) and variance(iter(data)) is an artifact of implementation, not a feature, so is subject to change.

- I doubt I will reject iterators, but I may internally convert them to lists (median already does this).

- For the time being, all documentation examples will only show lists being used.

- I will defer for 3.5 a set of one-pass functions that return running statistics (I already have code for coroutines to do this, but they're not ready for the std lib).

-- 
Steven