Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

21 Nov 2010

      On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman  wrote:
...
On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney  wrote:
...
On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman  wrote:
...
On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney  wrote:
...
Keith (and others),
What would you think about creating a library of mostly Cython-based
"domain specific functions"? So stuff like rolling statistical
moments, nan* functions like you have here, and all that-- NumPy-array
only functions that don't necessarily belong in NumPy or SciPy (but
could be included on down the road). You were already talking about
this on the statsmodels mailing list for larry. I spent a lot of time
writing a bunch of these for pandas over the last couple of years, and
I would have relatively few qualms about moving these outside of
pandas and introducing a dependency. You could do the same for larry--
then we'd all be relying on the same well-vetted and tested codebase.
I've started working on moving window statistics cython functions. I
plan to make it into a package called Roly (for rolling). The
signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
window, axis=-1), etc.
I think of Nanny and Roly as two separate packages. A narrow focus is
good for a new package. But maybe each package could be a subpackage
in a super package?
Would the function signatures in Nanny (exact duplicates of the
corresponding functions in Numpy and Scipy) work for pandas? I plan to
use Nanny in larry. I'll try to get the structure of the Nanny package
in place. But if it doesn't attract any interest after that then I may
fold it into larry.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Why make multiple packages? It seems like all these functions are
somewhat related: practical tools for real-world data analysis (where
observations are often missing). I suspect having everything under one
hood would create more interest than chopping things up-- would be
very useful to folks in many different disciplines (finance,
economics, statistics, etc.). In R, for example, NA-handling is just a
part of every day life. Of course in R there is a special NA value
which is distinct from NaN-- many folks object to the use of NaN for
missing values. The alternative is masked arrays, but in my case I
wasn't willing to sacrifice so much performance for purity's sake.
I could certainly use the nan* functions to replace code in pandas
where I've handled things in a somewhat ad hoc way.
A package focused on NaN-aware functions sounds like a good idea. I
think a good plan would be to start by making faster, drop-in
replacements for the NaN functions that are already in numpy and
scipy. That is already a lot of work. After that, one possibility is
to add stuff like nancumsum, nanprod, etc. After that moving window
stuff?
and maybe group functions after that?

If there is a lot of repetition, you could use templating. Even simple
string substitution, if it is only replacing the dtype, works pretty
well. It would at least reduce some copy-paste.

Josef
...
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] [ANN] Nanny, faster NaN functions

josef.pktd＠gmail.com