[Python-Dev] PEP 450 adding statistics module
steve at pearwood.info
Mon Sep 9 03:59:48 CEST 2013
On Sun, Sep 08, 2013 at 02:41:35PM -0700, Guido van Rossum wrote:
> On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
> <oscar.j.benjamin at gmail.com> wrote:
> > The most obvious alternative that isn't explicitly mentioned in the
> > PEP is to accept either:
> > def correlation(x, y=None):
> > if y is None:
> > xs = 
> > ys = 
> > for x, y in x:
> > xs.append(x)
> > ys.append(y)
> > else:
> > xs = list(x)
> > ys = list(y)
> > assert len(xs) == len(ys)
> > # In reality a helper function does the above.
> > # Now compute stuff
> > This avoids any unnecessary conversions and is as convenient as
> > possible for all users at the expense of having a slightly more
> > complicated API.
The PEP does mention that, as "some combination of the above".
The PEP also mentions that the decision of what API to use for
multivariate stats is deferred until 3.5, so there's plenty of time for
people to bike-shed this :-)
> I don't think this is really more convenient -- it is more to learn,
> and can cause surprises (e.g. when a user is only familiar with one
> format and then sees an example using the other format, they may be
> unable to understand the example).
> The one argument I *haven't* heard yet which *might* sway me would be
> something along the line "every other statistics package that users
> might be familiar with does it this way" or "all the statistics
> textbooks do it this way". (Because, frankly, when it comes to
> statistics I'm a rank amateur and I really want Steven's new module to
> educate me as much as help me compute specific statistical functions.)
I don't think that there is one common API for multivariate stats
packages. It partially depends on whether the package is aimed at basic
use or advanced use. I haven't done a systematic comparison of the most
common, but here are a few examples:
- The Casio Classpad graphing calculator has a spreadsheet-like
interface, which I consider equivalent to func(xdata, ydata).
- The HP-48G series of calculators uses a fixed global variable holding
a matrix, and a second global variable specifying which columns to use.
- The R "cor" (correlation coefficient) function takes either a pair of
vectors (lists), and calculates a single value, or a matrix, in which
case it calculates the correlation matrix.
- numpy.corrcoeff takes one or two array arguments, and a third argument
specifying whether to treat rows or columns as variables, and like R
returns either a single value or the correlation matrix.
- Minitab expects two seperate vector arguments, and returns the
correlation coefficient between them.
- If I'm reading the below page correctly, the SAS corr procedure
takes anything up to 27 arguments.
I don't suggest we follow that API :-)
Quite frankly, I consider the majority of stats APIs to be confusing
with a steep learning curve.
More information about the Python-Dev