[Python-Dev] PEP 450 adding statistics module

Guido van Rossum guido at python.org
Mon Sep 9 05:16:19 CEST 2013


Yeah, so this and Steven's review of various other APIs suggests that the
field of statistics hasn't really reached the object-oriented age (or
perhaps the OO view isn't suitable for the field), and people really think
of their data as a matrix of some sort. We should respect that. Now, if
this was NumPy, it would *still* make sense to require a single argument,
to be interpreted in the usual fashion. So I'm using that as a kind of
leverage to still recommend taking a list of pairs instead of a pair of
lists. Also, it's quite likely that at least *some* of the users of the new
statistics module will be more familiar with OO programming (e.g. the
Python DB API , PEP 249) than they are with other statistics packages.


On Sun, Sep 8, 2013 at 7:57 PM, Stephen J. Turnbull <stephen at xemacs.org>wrote:

> Guido van Rossum writes:
>  > On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
>  > <oscar.j.benjamin at gmail.com> wrote:
>  > > On 8 September 2013 18:32, Guido van Rossum <guido at python.org> wrote:
>  > >> Going over the open issues:
>  > >>
>  > >> - Parallel arrays or arrays of tuples? I think the API should require
>  > >> an array of tuples. It is trivial to zip up parallel arrays to the
>  > >> required format, while if you have an array of tuples, extracting the
>  > >> parallel arrays is slightly more cumbersome.
>  > >>
>  > >> Also for manipulating of the raw data, an array of tuples makes
>  > >> it easier to do insertions or removals without worrying about
>  > >> losing the correspondence between the arrays.
>
> I don't necessarily find this persuasive.  It's more common when
> working with existing databases that you add variables than add
> observations.  This is going to require attention to the
> correspondence in any case.  Observations aren't added, and they're
> "removed" temporarily for statistics on subsets by slicing.  If you
> use the same slice for all variables, you're not going to make a
> mistake.
>
>  > Not really. The implementation may change, or its needs may not be
>  > obvious to the caller. I would say the right thing to do is request
>  > something easy to remember, which often means consistent. In general,
>  > Python APIs definitely skew towards lists of tuples rather than
>  > parallel arrays, and for good reasons -- that way you benefit most
>  > from built-in operations like slices and insert/append.
>
> However, it's common in economic statistics to have a rectangular
> array, and extract both certain rows (tuples of observations on
> variables) and certain columns (variables).  For example you might
> have data on populations of American states from 1900 to 2012, and
> extract the data on New England states from 1946 to 2012 for analysis.
>
>  > The one argument I *haven't* heard yet which *might* sway me would be
>  > something along the line "every other statistics package that users
>  > might be familiar with does it this way" or "all the statistics
>  > textbooks do it this way". (Because, frankly, when it comes to
>  > statistics I'm a rank amateur and I really want Steven's new module to
>  > educate me as much as help me compute specific statistical functions.)
>
> In economic statistics, most software traditionally inputs variables
> in column-major order (ie, parallel arrays).  That said, most software
> nowadays allows input as spreadsheet tables.  You pays your money and
> you takes your choice.
>
> I think the example above of state population data shows that rows and
> columns are pretty symmetric here.  Many databases will have "too many"
> of both, and you'll want to "slice" both to get the sample and
> variables relevant to your analysis.
>
> This is all just for consideration; I am quite familiar with economic
> statistics and software, but not so much for that used in sociology,
> psychology, and medical applications.  In the end, I think it's best
> to leave it up to Steven's judgment as to what is convenient for him
> to maintain.
>



-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20130908/054b9fc5/attachment.html>


More information about the Python-Dev mailing list