On Sun, Sep 08, 2013 at 02:41:35PM -0700, Guido van Rossum wrote:
On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin firstname.lastname@example.org wrote:
The most obvious alternative that isn't explicitly mentioned in the PEP is to accept either:
def correlation(x, y=None): if y is None: xs =  ys =  for x, y in x: xs.append(x) ys.append(y) else: xs = list(x) ys = list(y) assert len(xs) == len(ys) # In reality a helper function does the above. # Now compute stuff
This avoids any unnecessary conversions and is as convenient as possible for all users at the expense of having a slightly more complicated API.
The PEP does mention that, as "some combination of the above".
The PEP also mentions that the decision of what API to use for multivariate stats is deferred until 3.5, so there's plenty of time for people to bike-shed this :-)
I don't think this is really more convenient -- it is more to learn, and can cause surprises (e.g. when a user is only familiar with one format and then sees an example using the other format, they may be unable to understand the example).
The one argument I *haven't* heard yet which *might* sway me would be something along the line "every other statistics package that users might be familiar with does it this way" or "all the statistics textbooks do it this way". (Because, frankly, when it comes to statistics I'm a rank amateur and I really want Steven's new module to educate me as much as help me compute specific statistical functions.)
I don't think that there is one common API for multivariate stats packages. It partially depends on whether the package is aimed at basic use or advanced use. I haven't done a systematic comparison of the most common, but here are a few examples:
- The Casio Classpad graphing calculator has a spreadsheet-like interface, which I consider equivalent to func(xdata, ydata).
- The HP-48G series of calculators uses a fixed global variable holding a matrix, and a second global variable specifying which columns to use.
- The R "cor" (correlation coefficient) function takes either a pair of vectors (lists), and calculates a single value, or a matrix, in which case it calculates the correlation matrix.
- numpy.corrcoeff takes one or two array arguments, and a third argument specifying whether to treat rows or columns as variables, and like R returns either a single value or the correlation matrix.
- Minitab expects two seperate vector arguments, and returns the correlation coefficient between them.
- If I'm reading the below page correctly, the SAS corr procedure takes anything up to 27 arguments.
I don't suggest we follow that API :-)
Quite frankly, I consider the majority of stats APIs to be confusing with a steep learning curve.