[Tutor] Matching relational data
Steven D'Aprano
steve at pearwood.info
Mon Oct 4 00:37:57 CEST 2010
On Mon, 4 Oct 2010 08:33:07 am David Hutto wrote:
> I'm creating an app that charts/graphs data. The mapping of the
> graphs is the 'easy' part with matplotlib,
> and wx. My question relates to the alignment of the data to be
> processed.
>
> Let's say I have three sets of 24 hr graphs with the same time steps:
>
> -the position of the sun
> -the temp.
> -local powerplant energy consumption
>
>
> A human could perceive the relations that when it's wintertime, cold
> and the sun goes down, heaters are turned on
> and energy consumption goes up, and the opposite in summer when it
> the sun comes up.
> My problem is how to compare and make the program perceive the
> relation.
This is a statistics problem, not a programming problem. Or rather,
parts of it *uses* programming to solve the statistics problem.
My statistics isn't good enough to tell you how to find correlations
between three independent variables, but I can break the problem into a
simpler one: find the correlation between two variables, temperature
and energy consumption.
Without understanding how the data was generated, I'm not entirely sure
how to set the data up, but here's one approach:
(1) Plot the relationship between:
x = temperature
y = power consumption
where x is the independent variable and y is the dependent variable.
(2) Look at the graph. Can you see any obvious pattern? If all the data
points are scattered randomly around the graph, there you can be
fairly sure that there is no correlation and you can go straight on
to calculating the correlation coefficient to make sure.
(3) But if the graph clearly appears to be made of separate sections,
AND those sections correlate to physical differences due to the time
of day (position of the sun), then you need to break the data set
into multiple data sets and work on each one individually.
E.g. if the graph forms a straight line pointing DOWN for the hours
11pm to 5am, and a straight line pointing UP for the hours 5am
to 11pm, and you can think of a physical reason why this is
plausible, then you would be justified in separating out the data
into two independent sets: 5am-11pm, 11pm-5am.
If you want to have the program do this part for you, this is a VERY
hard problem. You're essentially wanting to write an artifical
intelligence system capable of picking out statistical correlations
from data. Such software does exist. It tends to cost hundreds of
thousands of dollars, or millions. Good luck writing your own!
(4) Otherwise feel free to simplify the problem by just investigating
the relationship between temperature and power consumption during
(say) daylight hours.
(5) However you decide to proceed, you should now have one (or more) x-y
graph. First step is to decide whether there is any correlation at
all. If there is not, you can stop there. Calculate the correlation
coefficient, r. r will be a number between -1 and 1. r=1 means a
perfect positive correlation; r=-1 means a perfect negative
correlation. r=0 means no correlation at all.
(6) Decide whether the correlation is meaningful. I don't remember how
to do this -- consult your statistics text books. If it's not
meaningful, then you are done -- there's no statistically valid
relationship between the variables.
(7) Otherwise, you want to calculate the line of best fit (or possibly
some other curve, but let's stick to straight lines for now) for the
data. The line of best fit may be complicated to calculate, and it
may not be justified statistically, so start off with something
simpler which (hopefully!) is nearly as good -- a linear regression
line. This calculates a line that statistically matches your data.
(8) Technically, you can calculate a regression line for *any* data,
even if it clearly doesn't form a line. That's why you are checking
the correlation coefficient to decide whether it is sensible or not.
By now any *real* statisticians reading this will be horrified :) What
I've described is essentially the most basic, "Stats 101 for Dummies"
level.
Have fun!
--
Steven D'Aprano
More information about the Tutor
mailing list