On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney
wrote: On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman
wrote: On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney
wrote: Keith (and others),
What would you think about creating a library of mostly Cython-based "domain specific functions"? So stuff like rolling statistical moments, nan* functions like you have here, and all that-- NumPy-array only functions that don't necessarily belong in NumPy or SciPy (but could be included on down the road). You were already talking about this on the statsmodels mailing list for larry. I spent a lot of time writing a bunch of these for pandas over the last couple of years, and I would have relatively few qualms about moving these outside of pandas and introducing a dependency. You could do the same for larry-- then we'd all be relying on the same well-vetted and tested codebase.
I've started working on moving window statistics cython functions. I plan to make it into a package called Roly (for rolling). The signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr, window, axis=-1), etc.
I think of Nanny and Roly as two separate packages. A narrow focus is good for a new package. But maybe each package could be a subpackage in a super package?
Would the function signatures in Nanny (exact duplicates of the corresponding functions in Numpy and Scipy) work for pandas? I plan to use Nanny in larry. I'll try to get the structure of the Nanny package in place. But if it doesn't attract any interest after that then I may fold it into larry. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Why make multiple packages? It seems like all these functions are somewhat related: practical tools for real-world data analysis (where observations are often missing). I suspect having everything under one hood would create more interest than chopping things up-- would be very useful to folks in many different disciplines (finance, economics, statistics, etc.). In R, for example, NA-handling is just a part of every day life. Of course in R there is a special NA value which is distinct from NaN-- many folks object to the use of NaN for missing values. The alternative is masked arrays, but in my case I wasn't willing to sacrifice so much performance for purity's sake.
I could certainly use the nan* functions to replace code in pandas where I've handled things in a somewhat ad hoc way.
A package focused on NaN-aware functions sounds like a good idea. I think a good plan would be to start by making faster, drop-in replacements for the NaN functions that are already in numpy and scipy. That is already a lot of work. After that, one possibility is to add stuff like nancumsum, nanprod, etc. After that moving window stuff?
and maybe group functions after that? If there is a lot of repetition, you could use templating. Even simple string substitution, if it is only replacing the dtype, works pretty well. It would at least reduce some copy-paste. Josef
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion