[Tutor] Matching relational data

Mon Oct 4 00:37:57 CEST 2010

On Mon, 4 Oct 2010 08:33:07 am David Hutto wrote:
> I'm creating an app that charts/graphs data. The mapping of the
> graphs is the 'easy' part with matplotlib,
> and wx. My question relates to the alignment of the data to be
> processed.
>
> Let's say I have three sets of 24 hr graphs with the same time steps:
>
> -the position of the sun
> -the temp.
> -local powerplant energy consumption
>
>
> A human could perceive the relations that when it's wintertime, cold
> and the sun goes down, heaters are turned on
> and energy consumption goes up, and the opposite in summer when it
> the sun comes up.
> My problem is how to compare and make the program perceive the
> relation.

This is a statistics problem, not a programming problem. Or rather, 
parts of it *uses* programming to solve the statistics problem.

My statistics isn't good enough to tell you how to find correlations 
between three independent variables, but I can break the problem into a 
simpler one: find the correlation between two variables, temperature 
and energy consumption.

Without understanding how the data was generated, I'm not entirely sure 
how to set the data up, but here's one approach:

(1) Plot the relationship between:
    x = temperature
    y = power consumption

    where x is the independent variable and y is the dependent variable.

(2) Look at the graph. Can you see any obvious pattern? If all the data
    points are scattered randomly around the graph, there you can be
    fairly sure that there is no correlation and you can go straight on
    to calculating the correlation coefficient to make sure.

(3) But if the graph clearly appears to be made of separate sections, 
    AND those sections correlate to physical differences due to the time 
    of day (position of the sun), then you need to break the data set
    into multiple data sets and work on each one individually.

    E.g. if the graph forms a straight line pointing DOWN for the hours 
    11pm to 5am, and a straight line pointing UP for the hours 5am
    to 11pm, and you can think of a physical reason why this is
    plausible, then you would be justified in separating out the data
    into two independent sets: 5am-11pm, 11pm-5am.

    If you want to have the program do this part for you, this is a VERY
    hard problem. You're essentially wanting to write an artifical
    intelligence system capable of picking out statistical correlations 
    from data. Such software does exist. It tends to cost hundreds of
    thousands of dollars, or millions. Good luck writing your own!

(4) Otherwise feel free to simplify the problem by just investigating
    the relationship between temperature and power consumption during
    (say) daylight hours.

(5) However you decide to proceed, you should now have one (or more) x-y
    graph. First step is to decide whether there is any correlation at
    all. If there is not, you can stop there. Calculate the correlation 
    coefficient, r. r will be a number between -1 and 1. r=1 means a
    perfect positive correlation; r=-1 means a perfect negative
    correlation. r=0 means no correlation at all.

(6) Decide whether the correlation is meaningful. I don't remember how
    to do this -- consult your statistics text books. If it's not
    meaningful, then you are done -- there's no statistically valid
    relationship between the variables.

(7) Otherwise, you want to calculate the line of best fit (or possibly
    some other curve, but let's stick to straight lines for now) for the
    data. The line of best fit may be complicated to calculate, and it
    may not be justified statistically, so start off with something
    simpler which (hopefully!) is nearly as good -- a linear regression
    line. This calculates a line that statistically matches your data.

(8) Technically, you can calculate a regression line for *any* data,
    even if it clearly doesn't form a line. That's why you are checking
    the correlation coefficient to decide whether it is sensible or not.

By now any *real* statisticians reading this will be horrified :) What 
I've described is essentially the most basic, "Stats 101 for Dummies" 
level.

Have fun!

-- 
Steven D'Aprano