[Python-Dev] PEP 450 adding statistics module

Sun Sep 8 21:19:54 CEST 2013

On Sun, Sep 08, 2013 at 10:25:22AM -0700, Guido van Rossum wrote:
> Steven, I'd like to just approve the PEP, given the amount of
> discussion that's happened already (though I didn't follow much of
> it). I quickly glanced through the PEP and didn't find anything I'd
> personally object to, but then I found your section of open issues,
> and I realized that you don't actually specify the proposed API in the
> PEP itself. It's highly unusual to approve a PEP that doesn't contain
> a specification. What did I miss?

You didn't miss anything, but I may have.

Should the PEP go through each public function in the module (there are 
only 11)? That may be a little repetitive, since most have the same, or 
almost the same, signatures. Or is it acceptable to just include an 
overview? I've come up with this:

API

    The initial version of the library will provide univariate (single
    variable) statistics functions.  The general API will be based on a
    functional model ``function(data, ...) -> result``, where ``data``
    is a mandatory iterable of (usually) numeric data.

    The author expects that lists will be the most common data type used,
    but any iterable type should be acceptable.  Where necessary, functions
    may convert to lists internally.  Where possible, functions are
    expected to conserve the type of the data values, for example, the mean
    of a list of Decimals should be a Decimal rather than float.

    Calculating the mean, median and mode

        The ``mean``, ``median`` and ``mode`` functions take a single
        mandatory argument and return the appropriate statistic, e.g.:

        >>> mean([1, 2, 3])
        2.0

        ``mode`` is the sole exception to the rule that the data argument
        must be numeric.  It will also accept an iterable of nominal data,
        such as strings.

    Calculating variance and standard deviation

        In order to be similar to scientific calculators, the statistics
        module will include separate functions for population and sample
        variance and standard deviation.  All four functions have similar
        signatures, with a single mandatory argument, an iterable of
        numeric data, e.g.:

        >>> variance([1, 2, 2, 2, 3])
        0.5

        All four functions also accept a second, optional, argument, the
        mean of the data.  This is modelled on a similar API provided by
        the GNU Scientific Library[18].  There are three use-cases for
        using this argument, in no particular order:

            1)  The value of the mean is known *a priori*.

            2)  You have already calculated the mean, and wish to avoid
                calculating it again.

            3)  You wish to (ab)use the variance functions to calculate
                the second moment about some given point other than the
                mean.

        In each case, it is the caller's responsibility to ensure that
        given argument is meaningful.

Is this satisfactory or do I need to go into more detail?

-- 
Steven