[Python-ideas] Pre-PEP: adding a statistics module to Python

Tue Aug 6 17:49:17 CEST 2013

On 2 August 2013 18:45, Steven D'Aprano <steve at pearwood.info> wrote:
> I have raised an issue on the tracker to add a statistics module to Python's
> standard library:
>
> http://bugs.python.org/issue18606
>
> and have been asked to write a PEP. Attached is my draft PEP. Feedback is
> requested, thanks in advance.

I have another query/suggestion for the statistics module.

Taking the example from the PEP:

>>> from statistics import *
>>> data = [1, 2, 4, 5, 8]
>>> data = [x+1e12 for x in data]
>>> variance(data)
7.5

However:

>>> variance(iter(data))
7.4999542236328125

Okay so that's a small difference and it's unlikely to upset many
people. But being something of a numerical obsessive I do often get
upset about things like this. It's not that I mind the size of the
error but rather that I dislike having the calculation implicitly
changed. I want to think that it doesn't matter whether I pass an
iterator or a list because either I get an error or I get the same
result.

Now I understand that the reason is a switch from a 2-pass algorithm
to a 1-pass algorithm and that you want to support working directly
with iterators rather than just collections. However, toy examples
aside, I'm not sure that there is much of a practical use-case for
computing *individual* statistics on a single pass. Whenever I've
wanted to compute statistics on a single pass I've wanted to compute
*multiple* statistics in *the same* single pass.

Really I think that the use-cases are basically like this:

1) You can just put the data in a collection in memory (the common case).
2) Your data is too large to go in memory but you can iterate over it
from the disk, or network, or a computational generator or whatever.
Since the iteration is expensive or unrepeatable you want to compute
everything in one pass (happens sometimes but certainly a lot less
common than case 1)).
3) Your data/computation is distributed and you want to compute
statistics in a distributed/parallel framework and merge them later (a
very specialised setup that possibly warrants having its own
implementation of the statistical routines anyway).

Currently the API of the statistics module is only really suited to
case 1). I think that it would be better to limit it to that case to
simplify the implementation and make the output always consistent. In
other words I think it should just require a collection, reject
iterators, and use as many passes as it needs to get the best results.
This would make the implementation simpler in a number of areas.

An alternative API would be better for single-pass statistics (perhaps
deferred for now). In the past if I've made myself APIs for this they
look more like this:

>>> stats = iterstats('mean', 'min', 'var', 'count')
>>> stats.consume_data([1, 2, 3, 4])
>>> stats.compute_statistics()
{'mean': 2.5, 'min': 1, 'var': 1.666, 'count': 4}
>>> stats.consume_data([5, 6, 7, 8])
...

To satisfy use-case 3) is more complicated but it basically amounts to
being able to do something like:

>>> allstats = iterstats.merge([stats1, stats2, ...])

Oscar