[Numpy-discussion] Tools / data structures for statistical analysis and related applications

Fri Jun 11 11:26:26 EDT 2010

On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 06/09/2010 03:40 PM, Wes McKinney wrote:
>> Dear all,
>>
>> We've been having discussions on the pystatsmodels mailing list
>> recently regarding data structures and other tools for statistics /
>> other related data analysis applications.  I believe we're trying to
>> answer a number of different, but related questions:
>>
>> 1. What are the sets of functionality (and use cases) which would be
>> desirable for the scientific (or statistical) Python programmer?
>> Things like groupby
>> (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
>> fall into this category.
>>
>> 2. Do we really need to build custom data structures (larry, pandas,
>> tabular, etc.) or are structured ndarrays enough? (My conclusion is
>> that we do need to, but others might disagree). If so, how much
>> performance are we willing to trade for functionality?
>>
>> 3. What needs to happen for Python / NumPy / SciPy to really "break
>> in" to the statistical computing field? In other words, could a
>> Python-based stack one day be a competitive alternative to R?
>>
>> These are just some ideas for collecting community input. Of course as
>> we're all working in different problem domains, the needs of users
>> will vary quite a bit across the board. We've started to collect some
>> thoughts, links, etc. on the scipy.org wiki:
>>
>> http://scipy.org/StatisticalDataStructures
>>
>> A lot of what's there already is commentary and comparison on the
>> functionality provided by pandas and la / larry (since Keith and I
>> wrote most of the stuff there). But I think we're trying to identify
>> more generally the things that are lacking in NumPy/SciPy and related
>> libraries for particular applications. At minimum it should be good
>> fodder for the SciPy conferences this year and afterward (I am
>> submitting a paper on this subject based on my experiences).
>>
>> - Wes
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
> If you need pure data storage then all you require is an timeseries,
> masked structured ndarray. That will handle time/dates, missing values
> and named variables. This is probably the basis of all statistical
> packages, databases and spreadsheets. But the real problem is the
> blas/lapack usage that prevents anything but an standard narray.

For storing data sets I can agree that a structured / masked ndarray
is sufficient. But I think a lot of people are primarily concerned
about data manipulations in memory (which can be currently quite
obtuse). If you are referring to scikits.timeseries-- it expects data
to be fixed frequency which is a too rigid assumption for many
applications (like mine).

>
> The issue that I have with all these packages like tabulate, la and
> pandas that extend narrays is the 'upstream'/'downstream' problem of
> open source development. The real problem with these extensions of numpy
> is that while you can have whatever storage you like, you either need to
> write your own functions or preprocess the storage into an acceptable
> form. So you have to rely on those extensions being update with
> numpy/scipy since a 'fix' upstream can cause havoc downstream. I

In theory this could be a problem but of all packages to depend on in
the Python ecosystem, NumPy seems pretty safe. How many API breakages
have there been in ndarray in the last few years? Inherently this is a
risk of participating in open-source. After more than 2 years of
running a NumPy-SciPy based stack in production applications I feel
pretty comfortable. And besides, we write unit tests for a reason,
right?

> subscribe to what other have said elsewhere in the open source community
> in that it is very important to get your desired features upstream to
> the original project source - preferably numpy but scipy also counts.

>From my experience developing pandas it's not clear to me what I've
done that _should_ make its way "upstream" into NumPy and / or SciPy.
You could imagine some form of high-level statistical data structure
making its way into scipy.stats but I'm not sure. If NumPy could
incorporate something like R's NA value without substantially
degrading performance then that would be a boon to the issue of
handling missing data (which MaskedArray does do for us-- but at
non-trivial performance loss). Data alignment routines, groupby (which
is already on the table), and NaN / missing data sensitive moving
window functions (mean, median, std, etc.) would be nice general
additions as well. Any other ideas?

>
>
> Bruce
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>