[Python-Dev] PEP 450 adding statistics module

Sun Sep 8 23:41:35 CEST 2013

On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin
<oscar.j.benjamin at gmail.com> wrote:
> On 8 September 2013 18:32, Guido van Rossum <guido at python.org> wrote:
>> Going over the open issues:
>>
>> - Parallel arrays or arrays of tuples? I think the API should require
>> an array of tuples. It is trivial to zip up parallel arrays to the
>> required format, while if you have an array of tuples, extracting the
>> parallel arrays is slightly more cumbersome. Also for manipulating of
>> the raw data, an array of tuples makes it easier to do insertions or
>> removals without worrying about losing the correspondence between the
>> arrays.
>
> For something like this, where there are multiple obvious formats for
> the input data, I think it's reasonable to just request whatever is
> convenient for the implementation.

Not really. The implementation may change, or its needs may not be
obvious to the caller. I would say the right thing to do is request
something easy to remember, which often means consistent. In general,
Python APIs definitely skew towards lists of tuples rather than
parallel arrays, and for good reasons -- that way you benefit most
from built-in operations like slices and insert/append.

> Otherwise you're asking at least
> some of your users to convert data from one format to another just so
> that you can convert it back again. In any real problem you'll likely
> have more than two variables, so you'll be writing some code to
> prepare the data for the function anyway.

Yeah, so you might as well prepare it in the form that the API expects.

> The most obvious alternative that isn't explicitly mentioned in the
> PEP is to accept either:
>
> def correlation(x, y=None):
>     if y is None:
>         xs = []
>         ys = []
>         for x, y in x:
>             xs.append(x)
>             ys.append(y)
>     else:
>         xs = list(x)
>         ys = list(y)
>         assert len(xs) == len(ys)
>     # In reality a helper function does the above.
>     # Now compute stuff
>
> This avoids any unnecessary conversions and is as convenient as
> possible for all users at the expense of having a slightly more
> complicated API.

I don't think this is really more convenient -- it is more to learn,
and can cause surprises (e.g. when a user is only familiar with one
format and then sees an example using the other format, they may be
unable to understand the example).

The one argument I *haven't* heard yet which *might* sway me would be
something along the line "every other statistics package that users
might be familiar with does it this way" or "all the statistics
textbooks do it this way". (Because, frankly, when it comes to
statistics I'm a rank amateur and I really want Steven's new module to
educate me as much as help me compute specific statistical functions.)

-- 
--Guido van Rossum (python.org/~guido)