Yeah, so this and Steven's review of various other APIs suggests that the field of statistics hasn't really reached the object-oriented age (or perhaps the OO view isn't suitable for the field), and people really think of their data as a matrix of some sort. We should respect that. Now, if this was NumPy, it would *still* make sense to require a single argument, to be interpreted in the usual fashion. So I'm using that as a kind of leverage to still recommend taking a list of pairs instead of a pair of lists. Also, it's quite likely that at least *some* of the users of the new statistics module will be more familiar with OO programming (e.g. the Python DB API , PEP 249) than they are with other statistics packages.


On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Guido van Rossum writes:
 > On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
 > <oscar.j.benjamin@gmail.com> wrote:
 > > On 8 September 2013 18:32, Guido van Rossum <guido@python.org> wrote:
 > >> Going over the open issues:
 > >>
 > >> - Parallel arrays or arrays of tuples? I think the API should require
 > >> an array of tuples. It is trivial to zip up parallel arrays to the
 > >> required format, while if you have an array of tuples, extracting the
 > >> parallel arrays is slightly more cumbersome.
 > >>
 > >> Also for manipulating of the raw data, an array of tuples makes
 > >> it easier to do insertions or removals without worrying about
 > >> losing the correspondence between the arrays.

I don't necessarily find this persuasive.  It's more common when
working with existing databases that you add variables than add
observations.  This is going to require attention to the
correspondence in any case.  Observations aren't added, and they're
"removed" temporarily for statistics on subsets by slicing.  If you
use the same slice for all variables, you're not going to make a
mistake.

 > Not really. The implementation may change, or its needs may not be
 > obvious to the caller. I would say the right thing to do is request
 > something easy to remember, which often means consistent. In general,
 > Python APIs definitely skew towards lists of tuples rather than
 > parallel arrays, and for good reasons -- that way you benefit most
 > from built-in operations like slices and insert/append.

However, it's common in economic statistics to have a rectangular
array, and extract both certain rows (tuples of observations on
variables) and certain columns (variables).  For example you might
have data on populations of American states from 1900 to 2012, and
extract the data on New England states from 1946 to 2012 for analysis.

 > The one argument I *haven't* heard yet which *might* sway me would be
 > something along the line "every other statistics package that users
 > might be familiar with does it this way" or "all the statistics
 > textbooks do it this way". (Because, frankly, when it comes to
 > statistics I'm a rank amateur and I really want Steven's new module to
 > educate me as much as help me compute specific statistical functions.)

In economic statistics, most software traditionally inputs variables
in column-major order (ie, parallel arrays).  That said, most software
nowadays allows input as spreadsheet tables.  You pays your money and
you takes your choice.

I think the example above of state population data shows that rows and
columns are pretty symmetric here.  Many databases will have "too many"
of both, and you'll want to "slice" both to get the sample and
variables relevant to your analysis.

This is all just for consideration; I am quite familiar with economic
statistics and software, but not so much for that used in sociology,
psychology, and medical applications.  In the end, I think it's best
to leave it up to Steven's judgment as to what is convenient for him
to maintain.



--
--Guido van Rossum (python.org/~guido)