<div dir="ltr">Yeah, so this and Steven's review of various other APIs suggests that the field of statistics hasn't really reached the object-oriented age (or perhaps the OO view isn't suitable for the field), and people really think of their data as a matrix of some sort. We should respect that. Now, if this was NumPy, it would *still* make sense to require a single argument, to be interpreted in the usual fashion. So I'm using that as a kind of leverage to still recommend taking a list of pairs instead of a pair of lists. Also, it's quite likely that at least *some* of the users of the new statistics module will be more familiar with OO programming (e.g. the Python DB API , PEP 249) than they are with other statistics packages.<br>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <span dir="ltr"><<a href="mailto:stephen@xemacs.org" target="_blank">stephen@xemacs.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">Guido van Rossum writes:<br>
> On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin<br>
> <<a href="mailto:oscar.j.benjamin@gmail.com">oscar.j.benjamin@gmail.com</a>> wrote:<br>
> > On 8 September 2013 18:32, Guido van Rossum <<a href="mailto:guido@python.org">guido@python.org</a>> wrote:<br>
> >> Going over the open issues:<br>
> >><br>
> >> - Parallel arrays or arrays of tuples? I think the API should require<br>
> >> an array of tuples. It is trivial to zip up parallel arrays to the<br>
> >> required format, while if you have an array of tuples, extracting the<br>
> >> parallel arrays is slightly more cumbersome.<br>
> >><br>
> >> Also for manipulating of the raw data, an array of tuples makes<br>
> >> it easier to do insertions or removals without worrying about<br>
> >> losing the correspondence between the arrays.<br>
<br>
</div>I don't necessarily find this persuasive. It's more common when<br>
working with existing databases that you add variables than add<br>
observations. This is going to require attention to the<br>
correspondence in any case. Observations aren't added, and they're<br>
"removed" temporarily for statistics on subsets by slicing. If you<br>
use the same slice for all variables, you're not going to make a<br>
mistake.<br>
<div class="im"><br>
> Not really. The implementation may change, or its needs may not be<br>
> obvious to the caller. I would say the right thing to do is request<br>
> something easy to remember, which often means consistent. In general,<br>
> Python APIs definitely skew towards lists of tuples rather than<br>
> parallel arrays, and for good reasons -- that way you benefit most<br>
> from built-in operations like slices and insert/append.<br>
<br>
</div>However, it's common in economic statistics to have a rectangular<br>
array, and extract both certain rows (tuples of observations on<br>
variables) and certain columns (variables). For example you might<br>
have data on populations of American states from 1900 to 2012, and<br>
extract the data on New England states from 1946 to 2012 for analysis.<br>
<div class="im"><br>
> The one argument I *haven't* heard yet which *might* sway me would be<br>
> something along the line "every other statistics package that users<br>
> might be familiar with does it this way" or "all the statistics<br>
> textbooks do it this way". (Because, frankly, when it comes to<br>
> statistics I'm a rank amateur and I really want Steven's new module to<br>
> educate me as much as help me compute specific statistical functions.)<br>
<br>
</div>In economic statistics, most software traditionally inputs variables<br>
in column-major order (ie, parallel arrays). That said, most software<br>
nowadays allows input as spreadsheet tables. You pays your money and<br>
you takes your choice.<br>
<br>
I think the example above of state population data shows that rows and<br>
columns are pretty symmetric here. Many databases will have "too many"<br>
of both, and you'll want to "slice" both to get the sample and<br>
variables relevant to your analysis.<br>
<br>
This is all just for consideration; I am quite familiar with economic<br>
statistics and software, but not so much for that used in sociology,<br>
psychology, and medical applications. In the end, I think it's best<br>
to leave it up to Steven's judgment as to what is convenient for him<br>
to maintain.<br>
</blockquote></div><br><br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)
</div>