[Numpy-discussion] [ANN] Nanny, faster NaN functions

Sun Nov 21 18:16:21 EST 2010

On Sun, Nov 21, 2010 at 6:02 PM,  <josef.pktd at gmail.com> wrote:
> On Sun, Nov 21, 2010 at 5:09 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Sun, Nov 21, 2010 at 12:30 PM,  <josef.pktd at gmail.com> wrote:
>>> On Sun, Nov 21, 2010 at 2:48 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> On Sun, Nov 21, 2010 at 10:25 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>> On Sat, Nov 20, 2010 at 7:24 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>>>> On Sat, Nov 20, 2010 at 3:54 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>>>>
>>>>>>> Keith (and others),
>>>>>>>
>>>>>>> What would you think about creating a library of mostly Cython-based
>>>>>>> "domain specific functions"? So stuff like rolling statistical
>>>>>>> moments, nan* functions like you have here, and all that-- NumPy-array
>>>>>>> only functions that don't necessarily belong in NumPy or SciPy (but
>>>>>>> could be included on down the road). You were already talking about
>>>>>>> this on the statsmodels mailing list for larry. I spent a lot of time
>>>>>>> writing a bunch of these for pandas over the last couple of years, and
>>>>>>> I would have relatively few qualms about moving these outside of
>>>>>>> pandas and introducing a dependency. You could do the same for larry--
>>>>>>> then we'd all be relying on the same well-vetted and tested codebase.
>>>>>>
>>>>>> I've started working on moving window statistics cython functions. I
>>>>>> plan to make it into a package called Roly (for rolling). The
>>>>>> signatures are: mov_sum(arr, window, axis=-1) and mov_nansum(arr,
>>>>>> window, axis=-1), etc.
>>>>>>
>>>>>> I think of Nanny and Roly as two separate packages. A narrow focus is
>>>>>> good for a new package. But maybe each package could be a subpackage
>>>>>> in a super package?
>>>>>>
>>>>>> Would the function signatures in Nanny (exact duplicates of the
>>>>>> corresponding functions in Numpy and Scipy) work for pandas? I plan to
>>>>>> use Nanny in larry. I'll try to get the structure of the Nanny package
>>>>>> in place. But if it doesn't attract any interest after that then I may
>>>>>> fold it into larry.
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list
>>>>>> NumPy-Discussion at scipy.org
>>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>>>
>>>>>
>>>>> Why make multiple packages? It seems like all these functions are
>>>>> somewhat related: practical tools for real-world data analysis (where
>>>>> observations are often missing). I suspect having everything under one
>>>>> hood would create more interest than chopping things up-- would be
>>>>> very useful to folks in many different disciplines (finance,
>>>>> economics, statistics, etc.). In R, for example, NA-handling is just a
>>>>> part of every day life. Of course in R there is a special NA value
>>>>> which is distinct from NaN-- many folks object to the use of NaN for
>>>>> missing values. The alternative is masked arrays, but in my case I
>>>>> wasn't willing to sacrifice so much performance for purity's sake.
>>>>>
>>>>> I could certainly use the nan* functions to replace code in pandas
>>>>> where I've handled things in a somewhat ad hoc way.
>>>>
>>>> A package focused on NaN-aware functions sounds like a good idea. I
>>>> think a good plan would be to start by making faster, drop-in
>>>> replacements for the NaN functions that are already in numpy and
>>>> scipy. That is already a lot of work. After that, one possibility is
>>>> to add stuff like nancumsum, nanprod, etc. After that moving window
>>>> stuff?
>>>
>>> and maybe group functions after that?
>>
>> Yes, group functions are on my list.
>>
>>> If there is a lot of repetition, you could use templating. Even simple
>>> string substitution, if it is only replacing the dtype, works pretty
>>> well. It would at least reduce some copy-paste.
>>
>> Unit test coverage should be good enough to mess around with trying
>> templating. What's a good way to go? Write my own script that creates
>> the .pyx file and call it from the make file? Or are there packages
>> for doing the templating?
>
> Depends on the scale, I tried once with simple string templates
> http://codespeak.net/pipermail/cython-dev/2009-August/006614.html
>
> here is a pastbin of another version by ....(?),
> http://pastebin.com/f1a49143d discussed on the cython-dev mailing
> list.
>
> The cython list has the discussion every once in a while but I haven't
> seen any conclusion yet. For heavier duty templating a proper
> templating package (Jinja?) might be better.
>
> I'm not an expert.
>
> Josef
>
>
>>
>> I added nanmean (the first scipy function to enter nanny) and nanmin.
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

What would you say to a single package that contains:

- NaN-aware NumPy and SciPy functions (nanmean, nanmin, etc.)
- moving window functions (moving_{count, sum, mean, var, std, etc.})
- core subroutines for labeled data
- group-by functions
- other things to add to this list?

In other words, basic building computational tools for making
libraries like larry, pandas, etc. and doing time series / statistical
/ other manipulations on real world (messy) data sets. The focus isn't
so much "NaN-awareness" per se but more practical "data wrangling". I
would be happy to work on such a package and to move all the Cython
code I've written into it. There's a little bit of datarray overlap
potentially but I think that's OK