Smoothing a discrete set of data
Olsen, Peter C, SOFED
pcolsen at att.com
Thu Sep 12 19:08:21 CEST 2002
This is a followup to Paul Moore's post of 07 September 2002.
Paul wrote:
> I have a set of data, basically a histogram. The data is pretty
> variable, and I'd like to "smooth" it to find trends. Actually,
> this comes up a *lot* for me - some examples: sampled IO rates on
> a machine - I want to look for trends (is IO higher overnight or
> during the day, etc) or fuel consumption for my car (do I use
> less fuel since I had the service).
>
> Normally, what I do with things like this is draw a graph, and
> try to spot the trends "by eyeball". But it feels to me like I
> should be able to write something which smooths the data out. I
> just don't see how :-)
>
> Take an example - here's a rough ASCII-art diagram
>
> 9 XX XX
> 8 XX XX XX
> 7 XX XX XX XX XX
> 6 XX XX XX XX XX XX
> 5 XX XX XX XX XX XX
> 4 XX XX XX XX XX XX
> 3 XX XX XX XX XX XX XX XX XX
> 2 XX XX XX XX XX XX XX XX XX XX XX XX XX
> 1 XX XX XX XX XX XX XX XX XX XX XX XX XX XX XX
>
> This seems to me to have very clear peaks at the left and right,
> and a shallow point in the middle. I'd say that you could
> "smooth" this to something that averaged 8 for the first 3 bars,
> then 2 for the next 9, then 7 for the last 3. But try to make
> that precise...
>
>
You have hit on two of the fundamental problems of data analysis:
1. How to separate data from noise, and
2. What is the "best" way to do it.
First, let's simplify the problem and make it precise. What I think you
want to do is find the "large trends" in the data and get rid of the
noise.
You give sampled IO rates as an example. Network traffic is a similar
one. Network traffic seems to have a regular daily pattern. It's low
in the early morning, rises until before noon (perhaps about 11:00),
drops for lunch, rises again in the afternoon, and then falls at the end
of the workday.
That's the overall pattern (which is what I think you call a trend), but
the real data is never as simple as that.
If we look at short intervals, say 10 minutes, data is likely to be very
ragged, with lots of hills and valleys. What I think you'd like to do
is smooth out these hills and valleys.
To start to do that, let's build a mental model. Let's think of the
data as the sum of an underlying curve and a sequence of random errors.
Let's further assume that there is no relationship between successive
errors. (This is reasonable because we want to put all the relationship
between successive samples into the underlying curve.)
Here's how you could make such a random sequence. For every data point,
roll a die and flip a coin. If the coin is heads, add the number on the
die to the value of the data point; if it's tails, subtract instead. If
you take a particularly easy set of data points (setting every one to
the constant zero) and plot these errors, you'll see that they are
randomly distributed between -6 and 6 (excluding the case of 0 error
because the die has no zero face).
This is a good example because actual curve is simple: a straight line
at zero.
There are two simple ways to eliminate the random errors (and many
complex ones I won't mention).
One is a "moving average," the other is a "running median." Each has
its own advantages. Both are simple to implement.
For a moving average, take any odd number of successive data points,
take their arithmetic average (the sum of the values divided by the
number of points), and replace the middle point with the average.
For example, consider the sequence:
s = [5, 5, 6, 3, 5, 5, 2, 5, 6, 6, 4, 1, 1, 3, 6, 1, 2, 2, 1, 1]
These are random integers drawn with a uniform distribution from the set
{1,2,3,4,5}, so the underlying curve should be the constant 3.
If we take moving averages for a length of three, we get the sequence.
[5.33, 4.67, 4.67, 4.33, 4.0, 4.0, 4.33, 5.67, 5.33, 3.67, 2.0, 1.67,
3.33, 3.33, 3.0, 1.67, 1.67]
Notice that this sequence is two elements shorter than the original one.
This is because there is no way to calculate the moving average for the
points. There are no other points "beyond the end" to complete the
average.
Here's the Python function which computes the moving average of three.
def ma3(s):
return map(lambda x,y,z: (x+y+z)/3.0, s[0:-2], s[1:-1], s[2:])
Applying the function recursively three times
("ma3(ma3(ma3(ma3(ma3(seq)))))") yields
[4.34, 4.42, 4.56, 4.56, 4.25, 3.68, 3.13, 2.80, 2.74, 2.69]
If you apply the function again several times, you'll see more changes,
but the successive changes become smaller and smaller. Because you are
shortening the list by two every time, eventually, you'll run out of
data.
The other basic smoother is to consider any odd number of successive
data points and replace the center one by the median.
The median of a sequence of numbers is the number in the middle of them
when they are sorted. For example the median of the sequence
(2,5,5,10997,1,3,0) is 3. If we sort the original sequence we get
(0,1,2,3,5,5,10997). There is the same number of data points above 3 as
there is below it.
Here is some very inelegant code to calculate running medians of three:
def rm3(s):
temp = map(lambda x,y,z:[x,y,z], s[:-2], s[1:-1], s[2:])
temp2 = []
for x in temp:
x.sort()
temp2.append(x[1])
return temp2
(I'm sure that this can be rewritten in a functional form.)
Using this function to apply running medians of three to the initial
sequence yields
[5, 5, 5, 5, 5, 5, 5, 6, 6, 4, 1, 1, 3, 3, 2, 2, 2, 1].
You should notice one big difference right away; there are no fractions.
Each number in the sequence is always replaced by a number from the
sequence (maybe itself). This gets rid of statements like "the average
family has 1.8 children." This sometimes gives running medians the edge
when dealing with discrete data such as the number of I/O connections
per minute.
Notice too that the running median also reduces the influence of really
wild numbers on the calculation. The median of (2,5,5,10997,1,3,0) is 3,
but the mean (arithmetic average is 1573.29. This is often why it only
takes one or two very high scores to "raise the average" and "bust the
curve" in an academic examination.
Deciding whether it is better to use moving averages or to use running
medians is not always easy. In complex cases it requires deep
understanding of where the data came from, how it was measured, and how
you plan to use the numbers you get. It's too complex to try to discuss
here.
So, I hope this will help answer your question. If not, I hope I
haven't wasted too much of your time.
Peter Olsen, PE, Ae.E.
Senior Technical Staff Member
Government Solutions, ATT
More information about the Python-list
mailing list