Smoothing a discrete set of data

Sat Sep 7 17:37:34 EDT 2002

Fernando Pereira wrote:
> 
> On 9/7/02 6:02 AM, in article 3csmdso5.fsf at morpheus.demon.co.uk, "Paul
> Moore" <gustav at morpheus.demon.co.uk> wrote:
> > I have a set of data, basically a histogram. The data is pretty
> > variable, and I'd like to "smooth" it to find trends. [...] When I've tried to
> formalise this, I've always stumbled over the fact
> > that it's a trade-off exercise (there most be some sort of "tolerance"
> > parameter in the algorithm) - on the one hand, the original data is
> > "best" (in the sense that it matches exactly) whereas on the other
> > hand, a single horizontal line at the average is "best" (fewest number
> > of steps).
> >
> > My instinct is that you're looking for something that balances number
> > of steps vs difference from the original data. But for the life of me
> > I can't see how to make this intuition precise.
> >
> > Can anyone help me? I can't believe that this problem has never come
> > up before, but I can't find any literature on it.
> You must have been using the wrong search engine <wink> This is the problem
> of *regression*, extensively studied in statistics and machine learning. The
> tradeoff you mention is discussed in the literature under headings like
> "bias-variance tradeoff", "generalization bounds", "structural risk
> minimization"... If you want a single good book that covers this and related
> topics with a minimum of prerequisites (elementary calculus and linear
> algebra, a teeny bit of probability), I recommend
> 
> <http://www-stat.stanford.edu/~tibs/ElemStatLearn/>

I agree that the problems with which Paul is wrestling are addressed by
the 
discipline of modern statistics. However, I suspect that the
abovementioned
book by Hastie, Tibshirani and Friedman, although truly excellent, might
be
a bit too specialised for Paul's needs. Some introductory texts which
cover
various regression techniques (linear, non-linear, local, polynomial
etc) might be
more appropriate.

BTW, Paul, you describe your data as a "histogram" but say you want to
find "trends". 
"Histogram" implies that the data represent a frequency distribution,
but looking for 
"trends" implies that they are a time-series - you need to be clear in
your own mind
about this because different smoothing and regression techniques are
used for each.

Of course, there is no need to write your own routines to implement
these 
statistical techniques. AFAIK, there is no native Python package which
implements a
comprehensive range of regression and smoothing techniques, but I can
highly
recommend the R package for statistics and statistical graphics - this
mature, free, open
source package has more statistical facilities than you are ever likely
to need. 
It runs on all major platforms, and some excellent introductory (free)
texts are available for it - see http://www.r-project.org

If you are using a Unix, Linux or Mac OS X system, you can even use R
from within Python, thanks to Walter Moreira's excellent RPy module  -
see http://rpy.sourceforge.net

Hope this helps,

Tim C