[Python-Dev] PEP 450 adding statistics module

Mon Sep 9 03:59:48 CEST 2013

On Sun, Sep 08, 2013 at 02:41:35PM -0700, Guido van Rossum wrote:
> On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:

> > The most obvious alternative that isn't explicitly mentioned in the
> > PEP is to accept either:
> >
> > def correlation(x, y=None):
> >     if y is None:
> >         xs = []
> >         ys = []
> >         for x, y in x:
> >             xs.append(x)
> >             ys.append(y)
> >     else:
> >         xs = list(x)
> >         ys = list(y)
> >         assert len(xs) == len(ys)
> >     # In reality a helper function does the above.
> >     # Now compute stuff
> >
> > This avoids any unnecessary conversions and is as convenient as
> > possible for all users at the expense of having a slightly more
> > complicated API.

The PEP does mention that, as "some combination of the above".

The PEP also mentions that the decision of what API to use for 
multivariate stats is deferred until 3.5, so there's plenty of time for 
people to bike-shed this :-)

> I don't think this is really more convenient -- it is more to learn,
> and can cause surprises (e.g. when a user is only familiar with one
> format and then sees an example using the other format, they may be
> unable to understand the example).
> 
> The one argument I *haven't* heard yet which *might* sway me would be
> something along the line "every other statistics package that users
> might be familiar with does it this way" or "all the statistics
> textbooks do it this way". (Because, frankly, when it comes to
> statistics I'm a rank amateur and I really want Steven's new module to
> educate me as much as help me compute specific statistical functions.)

I don't think that there is one common API for multivariate stats 
packages. It partially depends on whether the package is aimed at basic 
use or advanced use. I haven't done a systematic comparison of the most 
common, but here are a few examples:

- The Casio Classpad graphing calculator has a spreadsheet-like 
interface, which I consider equivalent to func(xdata, ydata).

- The HP-48G series of calculators uses a fixed global variable holding 
a matrix, and a second global variable specifying which columns to use.

- The R "cor" (correlation coefficient) function takes either a pair of 
vectors (lists), and calculates a single value, or a matrix, in which 
case it calculates the correlation matrix.

- numpy.corrcoeff takes one or two array arguments, and a third argument 
specifying whether to treat rows or columns as variables, and like R 
returns either a single value or the correlation matrix.

- Minitab expects two seperate vector arguments, and returns the 
correlation coefficient between them.

- If I'm reading the below page correctly, the SAS corr procedure 
takes anything up to 27 arguments.

http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/procstat_corr_sect004.htm

I don't suggest we follow that API :-)

Quite frankly, I consider the majority of stats APIs to be confusing 
with a steep learning curve.

-- 
Steven