[Python-Dev] PEP 450 adding statistics module
Steven D'Aprano
steve at pearwood.info
Sun Sep 8 21:19:54 CEST 2013
On Sun, Sep 08, 2013 at 10:25:22AM -0700, Guido van Rossum wrote:
> Steven, I'd like to just approve the PEP, given the amount of
> discussion that's happened already (though I didn't follow much of
> it). I quickly glanced through the PEP and didn't find anything I'd
> personally object to, but then I found your section of open issues,
> and I realized that you don't actually specify the proposed API in the
> PEP itself. It's highly unusual to approve a PEP that doesn't contain
> a specification. What did I miss?
You didn't miss anything, but I may have.
Should the PEP go through each public function in the module (there are
only 11)? That may be a little repetitive, since most have the same, or
almost the same, signatures. Or is it acceptable to just include an
overview? I've come up with this:
API
The initial version of the library will provide univariate (single
variable) statistics functions. The general API will be based on a
functional model ``function(data, ...) -> result``, where ``data``
is a mandatory iterable of (usually) numeric data.
The author expects that lists will be the most common data type used,
but any iterable type should be acceptable. Where necessary, functions
may convert to lists internally. Where possible, functions are
expected to conserve the type of the data values, for example, the mean
of a list of Decimals should be a Decimal rather than float.
Calculating the mean, median and mode
The ``mean``, ``median`` and ``mode`` functions take a single
mandatory argument and return the appropriate statistic, e.g.:
>>> mean([1, 2, 3])
2.0
``mode`` is the sole exception to the rule that the data argument
must be numeric. It will also accept an iterable of nominal data,
such as strings.
Calculating variance and standard deviation
In order to be similar to scientific calculators, the statistics
module will include separate functions for population and sample
variance and standard deviation. All four functions have similar
signatures, with a single mandatory argument, an iterable of
numeric data, e.g.:
>>> variance([1, 2, 2, 2, 3])
0.5
All four functions also accept a second, optional, argument, the
mean of the data. This is modelled on a similar API provided by
the GNU Scientific Library[18]. There are three use-cases for
using this argument, in no particular order:
1) The value of the mean is known *a priori*.
2) You have already calculated the mean, and wish to avoid
calculating it again.
3) You wish to (ab)use the variance functions to calculate
the second moment about some given point other than the
mean.
In each case, it is the caller's responsibility to ensure that
given argument is meaningful.
Is this satisfactory or do I need to go into more detail?
--
Steven
More information about the Python-Dev
mailing list