[Python-Dev] PEP 450 adding statistics module

Stephen J. Turnbull stephen at xemacs.org
Mon Sep 9 04:57:50 CEST 2013

Guido van Rossum writes:
 > On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
 > <oscar.j.benjamin at gmail.com> wrote:
 > > On 8 September 2013 18:32, Guido van Rossum <guido at python.org> wrote:
 > >> Going over the open issues:
 > >>
 > >> - Parallel arrays or arrays of tuples? I think the API should require
 > >> an array of tuples. It is trivial to zip up parallel arrays to the
 > >> required format, while if you have an array of tuples, extracting the
 > >> parallel arrays is slightly more cumbersome.
 > >>
 > >> Also for manipulating of the raw data, an array of tuples makes
 > >> it easier to do insertions or removals without worrying about
 > >> losing the correspondence between the arrays.

I don't necessarily find this persuasive.  It's more common when
working with existing databases that you add variables than add
observations.  This is going to require attention to the
correspondence in any case.  Observations aren't added, and they're
"removed" temporarily for statistics on subsets by slicing.  If you
use the same slice for all variables, you're not going to make a

 > Not really. The implementation may change, or its needs may not be
 > obvious to the caller. I would say the right thing to do is request
 > something easy to remember, which often means consistent. In general,
 > Python APIs definitely skew towards lists of tuples rather than
 > parallel arrays, and for good reasons -- that way you benefit most
 > from built-in operations like slices and insert/append.

However, it's common in economic statistics to have a rectangular
array, and extract both certain rows (tuples of observations on
variables) and certain columns (variables).  For example you might
have data on populations of American states from 1900 to 2012, and
extract the data on New England states from 1946 to 2012 for analysis.

 > The one argument I *haven't* heard yet which *might* sway me would be
 > something along the line "every other statistics package that users
 > might be familiar with does it this way" or "all the statistics
 > textbooks do it this way". (Because, frankly, when it comes to
 > statistics I'm a rank amateur and I really want Steven's new module to
 > educate me as much as help me compute specific statistical functions.)

In economic statistics, most software traditionally inputs variables
in column-major order (ie, parallel arrays).  That said, most software
nowadays allows input as spreadsheet tables.  You pays your money and
you takes your choice.

I think the example above of state population data shows that rows and
columns are pretty symmetric here.  Many databases will have "too many"
of both, and you'll want to "slice" both to get the sample and
variables relevant to your analysis.

This is all just for consideration; I am quite familiar with economic
statistics and software, but not so much for that used in sociology,
psychology, and medical applications.  In the end, I think it's best
to leave it up to Steven's judgment as to what is convenient for him
to maintain.

More information about the Python-Dev mailing list