[Python-Dev] PEP 450 adding statistics module

Mon Sep 9 11:00:24 CEST 2013

On 9 September 2013 04:16, Guido van Rossum <guido at python.org> wrote:
>
> Yeah, so this and Steven's review of various other APIs suggests that the
> field of statistics hasn't really reached the object-oriented age (or
> perhaps the OO view isn't suitable for the field), and people really think
> of their data as a matrix of some sort. We should respect that. Now, if this
> was NumPy, it would *still* make sense to require a single argument, to be
> interpreted in the usual fashion. So I'm using that as a kind of leverage to
> still recommend taking a list of pairs instead of a pair of lists. Also,
> it's quite likely that at least *some* of the users of the new statistics
> module will be more familiar with OO programming (e.g. the Python DB API ,
> PEP 249) than they are with other statistics packages.

I'm not sure if I understand what you mean by this. Numpy has built
everything on top of a core ndarray class whose methods make the
issues about multivariate stats APIs trivial. The transpose of an
array A is simply the attribute A.T which is both convenient and cheap
since it's just an alternate view on the underlying buffer.

Also numpy provides record arrays that enable you to use names instead
of numeric indices:

>>> import numpy as np
>>> dt = np.dtype([('Year', int), ('Arizona', float), ('Dakota', float)])
>>> a = np.array([(2001, 123., 456.), (2002, 234., 345), (2003, 345., 567)], dt)
>>> a
array([(2001, 123.0, 456.0), (2002, 234.0, 345.0), (2003, 345.0, 567.0)],
      dtype=[('Year', '<i4'), ('Arizona', '<f8'), ('Dakota', '<f8')])
>>> a['Year']
array([2001, 2002, 2003])
>>> a['Arizona']
array([ 123.,  234.,  345.])
>>> np.corrcoef(a['Arizona'], a['Dakota'])
array([[ 1. ,  0.5],
       [ 0.5,  1. ]])
>>> included = a[a['Year'] > 2001]
>>> included
array([(2002, 234.0, 345.0), (2003, 345.0, 567.0)],
      dtype=[('Year', '<i4'), ('Arizona', '<f8'), ('Dakota', '<f8')])
>>> np.corrcoef(included['Arizona'], included['Dakota'])
array([[ 1.,  1.],
       [ 1.,  1.]])

So perhaps the statistics module could have a similar NameTupleArray
type that can be easily loaded and saved from a csv file and makes it
easy to put your data in whatever form is required.

Oscar