[Numpy-discussion] [ANN] Nanny, faster NaN functions

Sun Nov 21 17:09:34 EST 2010

On Sun, Nov 21, 2010 at 12:30 PM,  <josef.pktd at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>> On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>
>>>>> Keith (and others),
>>>>>
>>>>> What would you think about creating a library of mostly Cython-based
>>>>> "domain specific functions"? So stuff like rolling statistical
>>>>> moments, nan* functions like you have here, and all that-- NumPy-array
>>>>> only functions that don't necessarily belong in NumPy or SciPy (but
>>>>> could be included on down the road). You were already talking about
>>>>> this on the statsmodels mailing list for larry. I spent a lot of time
>>>>> writing a bunch of these for pandas over the last couple of years, and
>>>>> I would have relatively few qualms about moving these outside of
>>>>> pandas and introducing a dependency. You could do the same for larry--
>>>>> then we'd all be relying on the same well-vetted and tested codebase.
>>>>
>>>> I've started working on moving window statistics cython functions. I
>>>> plan to make it into a package called Roly (for rolling). The
>>>> signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
>>>> window, axis=-1), etc.
>>>>
>>>> I think of Nanny and Roly as two separate packages. A narrow focus is
>>>> good for a new package. But maybe each package could be a subpackage
>>>> in a super package?
>>>>
>>>> Would the function signatures in Nanny (exact duplicates of the
>>>> corresponding functions in Numpy and Scipy) work for pandas? I plan to
>>>> use Nanny in larry. I'll try to get the structure of the Nanny package
>>>> in place. But if it doesn't attract any interest after that then I may
>>>> fold it into larry.
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>
>>> Why make multiple packages? It seems like all these functions are
>>> somewhat related: practical tools for real-world data analysis (where
>>> observations are often missing). I suspect having everything under one
>>> hood would create more interest than chopping things up-- would be
>>> very useful to folks in many different disciplines (finance,
>>> economics, statistics, etc.). In R, for example, NA-handling is just a
>>> part of every day life. Of course in R there is a special NA value
>>> which is distinct from NaN-- many folks object to the use of NaN for
>>> missing values. The alternative is masked arrays, but in my case I
>>> wasn't willing to sacrifice so much performance for purity's sake.
>>>
>>> I could certainly use the nan* functions to replace code in pandas
>>> where I've handled things in a somewhat ad hoc way.
>>
>> A package focused on NaN-aware functions sounds like a good idea. I
>> think a good plan would be to start by making faster, drop-in
>> replacements for the NaN functions that are already in numpy and
>> scipy. That is already a lot of work. After that, one possibility is
>> to add stuff like nancumsum, nanprod, etc. After that moving window
>> stuff?
>
> and maybe group functions after that?

Yes, group functions are on my list.

> If there is a lot of repetition, you could use templating. Even simple
> string substitution, if it is only replacing the dtype, works pretty
> well. It would at least reduce some copy-paste.

Unit test coverage should be good enough to mess around with trying
templating. What's a good way to go? Write my own script that creates
the .pyx file and call it from the make file? Or are there packages
for doing the templating?

I added nanmean (the first scipy function to enter nanny) and nanmin.