Re: [Python-Dev] PEP 450 adding statistics module

8 Sep 2013

      On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
<oscar.j.benjamin@gmail.com> wrote:
...
On 8 September 2013 18:32, Guido van Rossum <guido@python.org> wrote:
...
Going over the open issues:
- Parallel arrays or arrays of tuples? I think the API should require
an array of tuples. It is trivial to zip up parallel arrays to the
required format, while if you have an array of tuples, extracting the
parallel arrays is slightly more cumbersome. Also for manipulating of
the raw data, an array of tuples makes it easier to do insertions or
removals without worrying about losing the correspondence between the
arrays.
For something like this, where there are multiple obvious formats for
the input data, I think it's reasonable to just request whatever is
convenient for the implementation.
Not really. The implementation may change, or its needs may not be
obvious to the caller. I would say the right thing to do is request
something easy to remember, which often means consistent. In general,
Python APIs definitely skew towards lists of tuples rather than
parallel arrays, and for good reasons -- that way you benefit most
from built-in operations like slices and insert/append.
...
Otherwise you're asking at least
some of your users to convert data from one format to another just so
that you can convert it back again. In any real problem you'll likely
have more than two variables, so you'll be writing some code to
prepare the data for the function anyway.
Yeah, so you might as well prepare it in the form that the API expects.
...
The most obvious alternative that isn't explicitly mentioned in the
PEP is to accept either:
def correlation(x, y=None):
    if y is None:
        xs = []
        ys = []
        for x, y in x:
            xs.append(x)
            ys.append(y)
    else:
        xs = list(x)
        ys = list(y)
        assert len(xs) == len(ys)
    # In reality a helper function does the above.
    # Now compute stuff
This avoids any unnecessary conversions and is as convenient as
possible for all users at the expense of having a slightly more
complicated API.
I don't think this is really more convenient -- it is more to learn,
and can cause surprises (e.g. when a user is only familiar with one
format and then sees an example using the other format, they may be
unable to understand the example).

The one argument I *haven't* heard yet which *might* sway me would be
something along the line "every other statistics package that users
might be familiar with does it this way" or "all the statistics
textbooks do it this way". (Because, frankly, when it comes to
statistics I'm a rank amateur and I really want Steven's new module to
educate me as much as help me compute specific statistical functions.)

-- 
--Guido van Rossum (python.org/~guido)