This is a new project I just released. I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing. http://code.google.com/p/incrementalstatistics/
Hi Bradford
2008/12/19 Bradford Cross
This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
Could you please send a slightly longer description of the idioms you describe, and how they would fit into scipy.stats, scikits.timeseries, etc.? Would you be interested in working on these enhancements? Thanks Stéfan
On Thu, Dec 18, 2008 at 8:27 PM, Bradford Cross
This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
I think an incremental stats module would be a boon to numpy or scipy. Eric Firing has a nice module wrtten in C with a pyrex wrapper (ringbuf) that does trailing incremental mean, median, std, min, max, and percentile. It maintains a sorted queue to do the last three efficiently, and handles NaN inputs. I would like to see this extended to include exponential or other weightings to do things like incremental trailing exponential moving averages and variances. I don't know what the licensing terms are of this module, but it might be a good starting point for an incremental numpy stats module, at least if you were thinking about supporting a finite lookback window. We have a copy of this in the py4science examples dir if you want to take a look: svn co https://matplotlib.svn.sourceforge.net/svnroot/matplotlib/trunk/py4science/e... cd trailstats/ make python movavg_ringbuf.py Other things that would be very useful are incremental covariance and regression. JDH
On a somewhat related note, I am looking for recursive calculation of variance for complex. For complex I want var as defined by E[x^2]. Is there an incremental (recursive) implementation in the complex case?
John Hunter wrote:
On Thu, Dec 18, 2008 at 8:27 PM, Bradford Cross
wrote: This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
I think an incremental stats module would be a boon to numpy or scipy. Eric Firing has a nice module wrtten in C with a pyrex wrapper (ringbuf) that does trailing incremental mean, median, std, min, max, and percentile. It maintains a sorted queue to do the last three efficiently, and handles NaN inputs. I would like to see this extended to include exponential or other weightings to do things like incremental trailing exponential moving averages and variances. I don't know what the licensing terms are of this module, but it might
Licensing is no problem; I have never bothered with it, but I can tack on a BSDtype license if that would help. Eric
be a good starting point for an incremental numpy stats module, at least if you were thinking about supporting a finite lookback window. We have a copy of this in the py4science examples dir if you want to take a look:
svn co https://matplotlib.svn.sourceforge.net/svnroot/matplotlib/trunk/py4science/e... cd trailstats/ make python movavg_ringbuf.py
Other things that would be very useful are incremental covariance and regression.
JDH
On Fri, Dec 19, 2008 at 12:59 PM, Eric Firing
Licensing is no problem; I have never bothered with it, but I can tack on a BSDtype license if that would help.
Great  if you are the copyright holder, would you commit a BSD license file to the py4science trailstats dir? I just committed the small bug fix we discussed yesterday there. Thanks! JDH
On Fri, Dec 19, 2008 at 6:53 AM, John Hunter
On Thu, Dec 18, 2008 at 8:27 PM, Bradford Cross
wrote: This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
I think an incremental stats module would be a boon to numpy or scipy. Eric Firing has a nice module wrtten in C with a pyrex wrapper (ringbuf) that does trailing incremental mean, median, std, min, max, and percentile. It maintains a sorted queue to do the last three efficiently, and handles NaN inputs. I would like to see this extended to include exponential or other weightings to do things like incremental trailing exponential moving averages and variances. I don't know what the licensing terms are of this module, but it might be a good starting point for an incremental numpy stats module, at least if you were thinking about supporting a finite lookback window. We have a copy of this in the py4science examples dir if you want to take a look:
svn co https://matplotlib.svn.sourceforge.net/svnroot/matplotlib/trunk/py4science/e... cd trailstats/ make python movavg_ringbuf.py
Other things that would be very useful are incremental covariance and regression.
Some sort of Kalman filter? Chuck
I did not know about this  very cool! I think I was asking around the
numpy/scipy lists a while back but nobody mentioned this; is it new?
A couple of questions inline below.
On Fri, Dec 19, 2008 at 2:53 PM, John Hunter
On Thu, Dec 18, 2008 at 8:27 PM, Bradford Cross
wrote: This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
I think an incremental stats module would be a boon to numpy or scipy. Eric Firing has a nice module wrtten in C with a pyrex wrapper (ringbuf)
Please excuse my ignorance  what is the performance overhead of calling C via the pyrex wrapper? A lot of use cases for incremental statistics are discrete event systems where the calculations will be updated millions or billions of times; this was a concern I had about doing the project in C and calling across a wrapper. Maybe it was one of those entirely speculative and unfounded concerns. :)
that does trailing incremental mean, median, std, min, max, and percentile. It maintains a sorted queue to do the last three efficiently, and handles NaN inputs.
Not sure if our results hold universally or even asymptoticly, but we found that our implimention of order/rank statistics was faster when we backed it with partition selection algorithms operating on an arraybased queue as opposed to our implimentaion of a sorted dequeue backed by a circular buffer. How does it handle NaN inputs exactly  does it just guard against them? That is the approach we took as well. We have a calculation guard that filters for both NaN and infinite values.
I would like to see this extended to include exponential or other weightings to do things like incremental trailing exponential moving averages and variances.
This is a cool idea that I hadn't thought of. We do have exponentially weighted mean, but ideally one could supply a weighting function to any statistic. We've been moving toward a more functional combinator style library design lately and this is anothr step in that direction.
I don't know what the licensing terms are of this module, but it might be a good starting point for an incremental numpy stats module, at least if you were thinking about supporting a finite lookback window.
Yes, it sound great! If you read the docs here: http://code.google.com/p/incrementalstatistics/ you can see that are have taken care to build the library from the beginning for static, accumulating, and rolling cases. The rolling case is what you are refering to as a finite lookback window, whereas accumualting as an accumulating lookback window, and the static case is the typical "compute hte mean of the entire sieries of observations at once" case. IMO, it turns out really nice when you think this way from the begnning becasue you get a lot of code reuse and nice oppertunities for composition.
We have a copy of this in the py4science examples dir if you want to take a look:
svn co https://matplotlib.svn.sourceforge.net/svnroot/matplotlib/trunk/py4science/e... cd trailstats/ make python movavg_ringbuf.py
Other things that would be very useful are incremental covariance and regression.
Indeed. We have a bit on the dependence statistics side, but not much. Incremental dependence and regression are the two hot items on the backlog. :)
JDH _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
On Friday 19 December 2008 03:27:12 Bradford Cross wrote:
This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
Hi, do you know about the boost accumulators project? It's still in boost's sandbox, but I love its design, and it provides a large number of welldocumented, mathematically sound estimators for variance, mean, etc.: http://boostsandbox.sourceforge.net/libs/accumulators/doc/html/index.html Just a headsup, in case someone finds this useful here. (Don't know about people's fondness of boost and/or C++ here.) Greetings, Hans
On Mon, Jan 19, 2009 at 7:34 PM, Hans Meine
On Friday 19 December 2008 03:27:12 Bradford Cross wrote:
This is a new project I just released.
I know it is C#, but some of the design and idioms would be nice in numpy/scipy for working with discrete event simulators, time series, and event stream processing.
Hi, do you know about the boost accumulators project?
It's still in boost's sandbox, but I love its design, and it provides a large number of welldocumented, mathematically sound estimators for variance, mean, etc.: http://boostsandbox.sourceforge.net/libs/accumulators/doc/html/index.html
Just a headsup, in case someone finds this useful here. (Don't know about people's fondness of boost and/or C++ here.)
Not a boost/C++ fan, but I like those projects. Incremental statistics have several advantages (outside the obvious one to get an online estimate when the data arrive sequentially): they can be much more memory friendly in a python context (for example, if you want to compute statistics for billion of samples, you could do in mini batches, and an incremental framework can help here), and they can often converge faster than an offline version if you have all the data. I am not yet clear how pervasive those techniques are  I have looked at several papers which prove the convergence of several well known algorithms, and implemented some of them (in particular online EM algorithm for online estimation of mixtures of Gaussian, with Bayesian variations for sequential model comparison), and I would have expected them to be more well known. I may just not be that familiar with the concerned fields, though. cheers, David
participants (8)

Bradford Cross

Charles R Harris

David Cournapeau

Eric Firing

Hans Meine

John Hunter

Neal Becker

Stéfan van der Walt