[Python-Dev] PEP 450 adding statistics module
Guido van Rossum
guido at python.org
Mon Sep 9 05:16:19 CEST 2013
Yeah, so this and Steven's review of various other APIs suggests that the
field of statistics hasn't really reached the object-oriented age (or
perhaps the OO view isn't suitable for the field), and people really think
of their data as a matrix of some sort. We should respect that. Now, if
this was NumPy, it would *still* make sense to require a single argument,
to be interpreted in the usual fashion. So I'm using that as a kind of
leverage to still recommend taking a list of pairs instead of a pair of
lists. Also, it's quite likely that at least *some* of the users of the new
statistics module will be more familiar with OO programming (e.g. the
Python DB API , PEP 249) than they are with other statistics packages.
On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <stephen at xemacs.org>wrote:
> Guido van Rossum writes:
> > On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
> > <oscar.j.benjamin at gmail.com> wrote:
> > > On 8 September 2013 18:32, Guido van Rossum <guido at python.org> wrote:
> > >> Going over the open issues:
> > >>
> > >> - Parallel arrays or arrays of tuples? I think the API should require
> > >> an array of tuples. It is trivial to zip up parallel arrays to the
> > >> required format, while if you have an array of tuples, extracting the
> > >> parallel arrays is slightly more cumbersome.
> > >>
> > >> Also for manipulating of the raw data, an array of tuples makes
> > >> it easier to do insertions or removals without worrying about
> > >> losing the correspondence between the arrays.
> I don't necessarily find this persuasive. It's more common when
> working with existing databases that you add variables than add
> observations. This is going to require attention to the
> correspondence in any case. Observations aren't added, and they're
> "removed" temporarily for statistics on subsets by slicing. If you
> use the same slice for all variables, you're not going to make a
> mistake.
> > Not really. The implementation may change, or its needs may not be
> > obvious to the caller. I would say the right thing to do is request
> > something easy to remember, which often means consistent. In general,
> > Python APIs definitely skew towards lists of tuples rather than
> > parallel arrays, and for good reasons -- that way you benefit most
> > from built-in operations like slices and insert/append.
> However, it's common in economic statistics to have a rectangular
> array, and extract both certain rows (tuples of observations on
> variables) and certain columns (variables). For example you might
> have data on populations of American states from 1900 to 2012, and
> extract the data on New England states from 1946 to 2012 for analysis.
> > The one argument I *haven't* heard yet which *might* sway me would be
> > something along the line "every other statistics package that users
> > might be familiar with does it this way" or "all the statistics
> > textbooks do it this way". (Because, frankly, when it comes to
> > statistics I'm a rank amateur and I really want Steven's new module to
> > educate me as much as help me compute specific statistical functions.)
> In economic statistics, most software traditionally inputs variables
> in column-major order (ie, parallel arrays). That said, most software
> nowadays allows input as spreadsheet tables. You pays your money and
> you takes your choice.
> I think the example above of state population data shows that rows and
> columns are pretty symmetric here. Many databases will have "too many"
> of both, and you'll want to "slice" both to get the sample and
> variables relevant to your analysis.
> This is all just for consideration; I am quite familiar with economic
> statistics and software, but not so much for that used in sociology,
> psychology, and medical applications. In the end, I think it's best
> to leave it up to Steven's judgment as to what is convenient for him
> to maintain.
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20130908/054b9fc5/attachment.html>
More information about the Python-Dev
mailing list