[Numpy-discussion] Tools / data structures for statistical analysis and related applications

Fri Jun 11 09:46:44 EDT 2010

On 06/09/2010 03:40 PM, Wes McKinney wrote:
> Dear all,
>
> We've been having discussions on the pystatsmodels mailing list
> recently regarding data structures and other tools for statistics /
> other related data analysis applications.  I believe we're trying to
> answer a number of different, but related questions:
>
> 1. What are the sets of functionality (and use cases) which would be
> desirable for the scientific (or statistical) Python programmer?
> Things like groupby
> (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
> fall into this category.
>
> 2. Do we really need to build custom data structures (larry, pandas,
> tabular, etc.) or are structured ndarrays enough? (My conclusion is
> that we do need to, but others might disagree). If so, how much
> performance are we willing to trade for functionality?
>
> 3. What needs to happen for Python / NumPy / SciPy to really "break
> in" to the statistical computing field? In other words, could a
> Python-based stack one day be a competitive alternative to R?
>
> These are just some ideas for collecting community input. Of course as
> we're all working in different problem domains, the needs of users
> will vary quite a bit across the board. We've started to collect some
> thoughts, links, etc. on the scipy.org wiki:
>
> http://scipy.org/StatisticalDataStructures
>
> A lot of what's there already is commentary and comparison on the
> functionality provided by pandas and la / larry (since Keith and I
> wrote most of the stuff there). But I think we're trying to identify
> more generally the things that are lacking in NumPy/SciPy and related
> libraries for particular applications. At minimum it should be good
> fodder for the SciPy conferences this year and afterward (I am
> submitting a paper on this subject based on my experiences).
>
> - Wes
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>    

If you need pure data storage then all you require is an timeseries, 
masked structured ndarray. That will handle time/dates, missing values 
and named variables. This is probably the basis of all statistical 
packages, databases and spreadsheets. But the real problem is the 
blas/lapack usage that prevents anything but an standard narray.

The issue that I have with all these packages like tabulate, la and 
pandas that extend narrays is the 'upstream'/'downstream' problem of 
open source development. The real problem with these extensions of numpy 
is that while you can have whatever storage you like, you either need to 
write your own functions or preprocess the storage into an acceptable 
form. So you have to rely on those extensions being update with 
numpy/scipy since a 'fix' upstream can cause havoc downstream. I 
subscribe to what other have said elsewhere in the open source community 
in that it is very important to get your desired features upstream to 
the original project source - preferably numpy but scipy also counts.

Bruce