Smoothing a discrete set of data

Sat Sep 7 09:44:51 EDT 2002

"Paul Moore" <gustav at morpheus.demon.co.uk> wrote in message
news:3csmdso5.fsf at morpheus.demon.co.uk...
> This isn't so much a Python question, as an algorithms question. I
> couldn't find a newsgroup for discussing algorithms (is there a good
> one?) so as I'm implementing the code in Python I thought I'd try here
> - I hope that's OK.
>
> I have a set of data, basically a histogram. The data is pretty
> variable, and I'd like to "smooth" it to find trends. Actually, this
> comes up a *lot* for me - some examples: sampled IO rates on a machine
> - I want to look for trends (is IO higher overnight or during the day,
> etc) or fuel consumption for my car (do I use less fuel since I had
> the service).
>
> Normally, what I do with things like this is draw a graph, and try to
> spot the trends "by eyeball". But it feels to me like I should be able
> to write something which smooths the data out. I just don't see how
> :-)
>
> Take an example - here's a rough ASCII-art diagram
>
>   9     XX                                  XX
>   8     XX XX                               XX
>   7  XX XX XX                               XX XX
>   6  XX XX XX                            XX XX XX
>   5  XX XX XX                            XX XX XX
>   4  XX XX XX                            XX XX XX
>   3  XX XX XX XX          XX          XX XX XX XX
>   2  XX XX XX XX XX    XX XX XX    XX XX XX XX XX
>   1  XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX
>
> This seems to me to have very clear peaks at the left and right, and a
> shallow point in the middle. I'd say that you could "smooth" this to
> something that averaged 8 for the first 3 bars, then 2 for the next 9,
> then 7 for the last 3. But try to make that precise...
>

You could change the bin sizes, but I'd generally keep them equal.

> When I've tried to formalise this, I've always stumbled over the fact
> that it's a trade-off exercise (there most be some sort of "tolerance"
> parameter in the algorithm) - on the one hand, the original data is
> "best" (in the sense that it matches exactly) whereas on the other
> hand, a single horizontal line at the average is "best" (fewest number
> of steps).
>
> My instinct is that you're looking for something that balances number
> of steps vs difference from the original data. But for the life of me
> I can't see how to make this intuition precise.
>
> Can anyone help me? I can't believe that this problem has never come
> up before, but I can't find any literature on it.
>
> Thanks in advance,
> Paul Moore.

You could use something like a moving average, but I'm not sure that will
help you answer any specific questions.  It sounds like you might be
interested in comparing means (for fuel consumption), in which case you'd do
something like a (paired?) t-test (and you wouldn't generally throw away
information by smoothing the data).

IO rates and fuel consumption would generally be approached differently
because the data generating processes have different characteristics.  (I
wouldn't try to answer them both using a single method.)  This really sounds
like a job that needs a statistical approach.

Duncan