Smoothing a discrete set of data

Sat Sep 7 12:06:07 EDT 2002

On 9/7/02 6:02 AM, in article 3csmdso5.fsf at morpheus.demon.co.uk, "Paul
Moore" <gustav at morpheus.demon.co.uk> wrote:
> I have a set of data, basically a histogram. The data is pretty
> variable, and I'd like to "smooth" it to find trends. [...] When I've tried to
formalise this, I've always stumbled over the fact
> that it's a trade-off exercise (there most be some sort of "tolerance"
> parameter in the algorithm) - on the one hand, the original data is
> "best" (in the sense that it matches exactly) whereas on the other
> hand, a single horizontal line at the average is "best" (fewest number
> of steps).
> 
> My instinct is that you're looking for something that balances number
> of steps vs difference from the original data. But for the life of me
> I can't see how to make this intuition precise.
> 
> Can anyone help me? I can't believe that this problem has never come
> up before, but I can't find any literature on it.
You must have been using the wrong search engine <wink> This is the problem
of *regression*, extensively studied in statistics and machine learning. The
tradeoff you mention is discussed in the literature under headings like
"bias-variance tradeoff", "generalization bounds", "structural risk
minimization"... If you want a single good book that covers this and related
topics with a minimum of prerequisites (elementary calculus and linear
algebra, a teeny bit of probability), I recommend

<http://www-stat.stanford.edu/~tibs/ElemStatLearn/>

-- F