[Python-ideas] Pre-PEP: adding a statistics module to Python

Fri Aug 2 19:45:19 CEST 2013

I have raised an issue on the tracker to add a statistics module to Python's standard library:

http://bugs.python.org/issue18606

and have been asked to write a PEP. Attached is my draft PEP. Feedback is requested, thanks in advance.

-- 
Steven
-------------- next part --------------
PEP: xxx
Title: Adding A Statistics Module To The Standard Library
Version: $Revision$
Last-Modified: $Date$
Author: Steven D'Aprano <steve at pearwood.info>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 01-Aug-2013
Python-Version: 3.4
Post-History:

Abstract

    This PEP proposes the addition of a module for common statistics functions
    such as mean, median, variance and standard deviation to the Python
    standard library.

Rationale

    The proposed statistics module is motivated by the "batteries included"
    philosophy towards the Python standard library.  Statistical functions such
    as mean, standard deviation and others are obvious and useful batteries,
    familiar to any Secondary School student.  Even cheap scientific
    calculators typically include multiple statistical functions, such as:

    - mean
    - population and sample variance
    - population and sample standard deviation
    - linear regression
    - correlation coefficient

    Graphing calculators aimed at Secondary School students typically
    include all of the above, plus some or all of:

    - median
    - mode
    - functions for calculating the probability of random variables
      from the normal, t, chi-squared, and F distributions
    - inference on the mean

    and others[1].  Likewise spreadsheet applications such as Microsoft Excel,
    LibreOffice and Gnumeric include rich collections of statistical
    functions[2].

    In contrast, Python currently has no standard way to calculate even the
    simplest and most obvious statistical functions such as mean.  For those
    who need statistical functions in Python, there are two obvious solutions:

    - install numpy and/or scipy[3];

    - or use a Do It Yourself solution.

    Numpy is perhaps the most full-featured solution, but it has a few
    disadvantages:

    - It may be overkill for many purposes.  The documentation for numpy even
      warns

          "It can be hard to know what functions are available in
          numpy.  This is not a complete list, but it does cover
          most of them."[4]

      and then goes on to list over 270 functions, only a small number of
      which are related to statistics.

    - Numpy is aimed at those doing heavy numerical work, and may be
      intimidating to those who don't have a background in computational
      mathematics and computer science.  For example, numpy.mean takes four
      arguments:

        mean(a, axis=None, dtype=None, out=None)

      although fortunately for the beginner or casual numpy user, three are
      optional and numpy.mean does the right thing in simple cases:

          >>>  numpy.mean([1, 2, 3, 4])
          2.5

    - For many people, installing numpy may be difficult or impossible.  For
      example, people in corporate environments may have to go through a
      difficult, time-consuming process before being permitted to install
      third-party software.  For the casual Python user, having to learn about
      installing third-party packages in order to average a list of numbers is
      unfortunate.

    This leads to option number 2, DIY statistics functions.  At first glance,
    this appears to be an attractive option, due to the apparent simplicity of
    common statistical functions.  For example:

        def mean(data):
            return sum(data)/len(data)

        def variance(data):
            # Use the Computational Formula for Variance.
            n = len(data)
            ss = sum(x**2 for x in data) - (sum(data)**2)/n
            return ss/(n-1)

        def standard_deviation(data):
            return math.sqrt(variance(data))

    The above appears to be correct with a casual test:

        >>> data = [1, 2, 4, 5, 8]
        >>> variance(data)
        7.5

    But adding a constant to every data point should not change the variance:

        >>> data = [x+1e12 for x in data]
        >>> variance(data)
        0.0

    And variance should *never* be negative:

        >>> variance(data*100)
        -1239429440.1282566

    By contrast, the proposed reference implementation gets the exactly correct
    answer 7.5 for the first two examples, and a reasonably close answer for
    the third: 6.012. numpy does no better[5].

    Even simple statistical calculations contain traps for the unwary, starting
    with the Computational Formula itself.  Despite the name, it is numerically
    unstable and can be extremely inaccurate, as can be seen above.  It is
    completely unsuitable for computation by computer[6].  This problem plagues
    users of many programming language, not just Python[7], as coders reinvent
    the same numerically inaccurate code over and over again[8], or advise
    others to do so[9].

    It isn't just the variance and standard deviation. Even the mean is not
    quite as straight-forward as it might appear.  The above implementation
    seems to simple to have problems, but it does:

    - The built-in sum can lose accuracy when dealing with floats of wildly
      differing magnitude.  Consequently, the above naive mean fails this
      "torture test" with an error of 100%:

          assert mean([1e30, 1, 3, -1e30]) == 1

    - Using math.fsum inside mean will make it more accurate with float data,
      but it also has the side-effect of converting any arguments to float
      even when unnecessary.  E.g. we should expect the mean of a list of
      Fractions to be a Fraction, not a float.

    While the above mean implementation does not fail quite as catastrophically
    as the naive variance does, a standard library function can do much better
    than the DIY versions.

Comparison To Other Languages/Packages

    The proposed statistics library is not intended to be a competitor to such
    third-party libraries as numpy/scipy, or of proprietary full-featured
    statistics packages aimed at professional statisticians such as Minitab,
    SAS and Matlab.  It is aimed at the level of graphing and scientific
    calculators.

    Most programming languages have little or no built-in support for
    statistics functions.  Some exceptions:

    R
        R (and its proprietary cousin, S) is a programming language designed
        for statistics work. It is extremely popular with statisticians and
        is extremely feature-rich[10].

    C#

        The C# LINQ package includes extension methods to calculate the
        average of enumerables[11].

    Ruby

        Ruby does not ship with a standard statistics module, despite some
        apparent demand[12].  Statsample appears to be a feature-rich third-
        party library, aiming to compete with R[13].

    PHP

        PHP has an extremely feature-rich (although mostly undocumented) set
        of advanced statistical functions[14].

    Delphi

        Delphi includes standard statistical functions including Mean, Sum,
        Variance, TotalVariance, MomentSkewKurtosis in its Math library[15].

    GNU Scientific Library

        The GNU Scientific Library includes standard statistical functions,
        percentiles, median and others[16].  One innovation I have borrowed
        from the GSL is to allow the caller to optionally specify the pre-
        calculated mean of the sample (or an a priori known population mean)
        when calculating the variance and standard deviation[17].

Design Decisions Of The Module

    In the statistics module, I have aimed for the following design features:

    - Correctness over speed.  It is easier to speed up a correct but slow
      function than to correct a fast but buggy one.

    - The caller should, where possible, avoid having to make decisions about
      implementations.  The caller should not have to decide whether to call
      a one-pass function data in an iterator, or a two-pass function for data
      in a list. Instead, the function should do the right thing in each case.

    - Where there is the possibility of different results depending on one-pass
      or two-pass algorithms, that possibility should be documented and the
      caller can then make the appropriate choice. (In most cases, such
      differences will be tiny.)

    - Functions should, as much as possible, honour any type of numeric data.
      E.g. the mean of a list of Decimals should be a Decimal, not a float.
      When this is not possible, treat float as the "lowest common data type".

    - Although functions support data sets of floats, Decimals or Fractions,
      there is no guarantee that *mixed* data sets will be supported. (But on
      the other hand, they aren't explicitly rejected either.)

    - Plenty of documentation, aimed at readers who understand the basic
      concepts but may not know (for example) which variance they should use
      (population or sample?). Mathematicians and statisticians have a terrible
      habit of being inconsistent with both notation and terminology[18], and
      having spent many hours making sense of the contradictory/confusing
      definitions in use, it is only fair that I do my best to clarify rather
      than obfuscate the topic.

    - But avoid going into tedious[19] mathematical detail.

Specification

    As the proposed reference implementation is in pure Python,
    other Python implementations can easily make use of the module
    unchanged, or adapt it as they see fit.

Previous Discussions

    This proposal has been previously discussed here[20].

Open Issues

    My intention is to start small and grow the library, rather than try to
    include everything from the start.

    - At this stage, I am unsure of the best API for multivariate statistical
      functions such as linear regression, correlation coefficient, and
      covariance. Possible APIs include:

        * Separate arguments for x and y data:
          function([x0, x1, ...], [y0, y1, ...])

        * A single argument for (x, y) data:
          function([(x0, y0), (x1, y1), ...])

        * Selecting arbitrary columns from a 2D array:
          function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)

        * Some combination of two or more of the above.

      In the absence of a consensus of preferred API for multivariate stats,
      I will defer including such multivariate functions until Python 3.5.

    - Likewise, functions for calculating probability of random variables and
      inference testing will be deferred until 3.5.

References

    [1] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf

    [2] Gnumeric:
            https://projects.gnome.org/gnumeric/functions.shtml

        LibreOffice:
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
            https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five

    [3] Scipy: http://scipy-central.org/
        Numpy: http://www.numpy.org/

    [4] http://wiki.scipy.org/Numpy_Functions_by_Category

    [5] Tested with numpy 1.6.1 and Python 2.7.

    [6] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/

    [7] http://rosettacode.org/wiki/Standard_deviation

    [8] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py

    [9] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration

    [10] http://www.r-project.org/

    [11] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx

    [12] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation

    [13] http://ruby-statsample.rubyforge.org/

    [14] http://www.php.net/manual/en/ref.stats.php

    [15] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.

    [16] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html

    [17] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html

    [18] http://mathworld.wolfram.com/Skewness.html

    [19] At least, tedious to those who don't like this sort of thing.

    [20] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html

Copyright

    This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: