[Numpy-discussion] Tools / data structures for statistical analysis and related applications

Fri Jun 11 13:57:37 EDT 2010

On 06/11/2010 10:26 AM, Wes McKinney wrote:
> On Fri, Jun 11, 2010 at 9:46 AM, Bruce Southey<bsouthey at gmail.com>  wrote:
>    
>> On 06/09/2010 03:40 PM, Wes McKinney wrote:
>>      
>>> Dear all,
>>>
>>> We've been having discussions on the pystatsmodels mailing list
>>> recently regarding data structures and other tools for statistics /
>>> other related data analysis applications.  I believe we're trying to
>>> answer a number of different, but related questions:
>>>
>>> 1. What are the sets of functionality (and use cases) which would be
>>> desirable for the scientific (or statistical) Python programmer?
>>> Things like groupby
>>> (http://projects.scipy.org/numpy/browser/trunk/doc/neps/groupby_additions.rst)
>>> fall into this category.
>>>
>>> 2. Do we really need to build custom data structures (larry, pandas,
>>> tabular, etc.) or are structured ndarrays enough? (My conclusion is
>>> that we do need to, but others might disagree). If so, how much
>>> performance are we willing to trade for functionality?
>>>
>>> 3. What needs to happen for Python / NumPy / SciPy to really "break
>>> in" to the statistical computing field? In other words, could a
>>> Python-based stack one day be a competitive alternative to R?
>>>
>>> These are just some ideas for collecting community input. Of course as
>>> we're all working in different problem domains, the needs of users
>>> will vary quite a bit across the board. We've started to collect some
>>> thoughts, links, etc. on the scipy.org wiki:
>>>
>>> http://scipy.org/StatisticalDataStructures
>>>
>>> A lot of what's there already is commentary and comparison on the
>>> functionality provided by pandas and la / larry (since Keith and I
>>> wrote most of the stuff there). But I think we're trying to identify
>>> more generally the things that are lacking in NumPy/SciPy and related
>>> libraries for particular applications. At minimum it should be good
>>> fodder for the SciPy conferences this year and afterward (I am
>>> submitting a paper on this subject based on my experiences).
>>>
>>> - Wes
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>        
>> If you need pure data storage then all you require is an timeseries,
>> masked structured ndarray. That will handle time/dates, missing values
>> and named variables. This is probably the basis of all statistical
>> packages, databases and spreadsheets. But the real problem is the
>> blas/lapack usage that prevents anything but an standard narray.
>>      
> For storing data sets I can agree that a structured / masked ndarray
> is sufficient. But I think a lot of people are primarily concerned
> about data manipulations in memory (which can be currently quite
> obtuse).
Well that is not storage :-)
Data manipulations are too case dependent and full of comprises between 
flexibility, memory usage and cpu time. For example, do I create a 
design matrix X so I can compute np.dot(X.T, X) or directly form the 
product as I read the data? The former is a memory hog because I have 
potentially huge X array as well as the smaller product array - this 
holds for any solving approach that work on X. Not to mention that X.T*X 
is symmetric which is further savings especially if you can use the 
symmetric functions of blas/lapack.

> If you are referring to scikits.timeseries-- it expects data
> to be fixed frequency which is a too rigid assumption for many
> applications (like mine).
>    
I am referring to any container that holds a date/time variable such as 
the datetime module.
>    
>> The issue that I have with all these packages like tabulate, la and
>> pandas that extend narrays is the 'upstream'/'downstream' problem of
>> open source development. The real problem with these extensions of numpy
>> is that while you can have whatever storage you like, you either need to
>> write your own functions or preprocess the storage into an acceptable
>> form. So you have to rely on those extensions being update with
>> numpy/scipy since a 'fix' upstream can cause havoc downstream. I
>>      
> In theory this could be a problem but of all packages to depend on in
> the Python ecosystem, NumPy seems pretty safe. How many API breakages
> have there been in ndarray in the last few years? Inherently this is a
> risk of participating in open-source. After more than 2 years of
> running a NumPy-SciPy based stack in production applications I feel
> pretty comfortable. And besides, we write unit tests for a reason,
> right?
>
>    
>> subscribe to what other have said elsewhere in the open source community
>> in that it is very important to get your desired features upstream to
>> the original project source - preferably numpy but scipy also counts.
>>      
> > From my experience developing pandas it's not clear to me what I've
> done that _should_ make its way "upstream" into NumPy and / or SciPy.
> You could imagine some form of high-level statistical data structure
> making its way into scipy.stats but I'm not sure.
As I indicated above, you have to rewrite the functions to use some new 
data structure and I think that would be a negative-sum game.

> If NumPy could
> incorporate something like R's NA value without substantially
> degrading performance then that would be a boon to the issue of
> handling missing data (which MaskedArray does do for us-- but at
> non-trivial performance loss).

Numpy is not orientated to the same goals as S (or SAS or any other 
stats application) so it it not a valid comparison to make. For example, 
S was designed from the start to " support serious data analysis"
http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html
and "[f]rom the beginning, S was designed to provide a complete 
environment for data analysis"
http://cm.bell-labs.com/stat/doc/96.7.ps

There is also the issue of how S/R handles missing values as well.

> Data alignment routines, groupby (which
> is already on the table), and NaN / missing data sensitive moving
> window functions (mean, median, std, etc.) would be nice general
> additions as well. Any other ideas?
>    
>
At present I am waiting to see what happens with pystatsmodels as Python 
stats analysis is not very high on my list as other Python things.

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100611/4641998b/attachment.html>