[Tutor] Matching relational data

Mon Oct 4 04:20:35 CEST 2010

On Sun, Oct 3, 2010 at 6:37 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Mon, 4 Oct 2010 08:33:07 am David Hutto wrote:
>> I'm creating an app that charts/graphs data. The mapping of the
>> graphs is the 'easy' part with matplotlib,
>> and wx. My question relates to the alignment of the data to be
>> processed.
>>
>> Let's say I have three sets of 24 hr graphs with the same time steps:
>>
>> -the position of the sun
>> -the temp.
>> -local powerplant energy consumption
>>
>>
>> A human could perceive the relations that when it's wintertime, cold
>> and the sun goes down, heaters are turned on
>> and energy consumption goes up, and the opposite in summer when it
>> the sun comes up.
>> My problem is how to compare and make the program perceive the
>> relation.
>
> This is a statistics problem, not a programming problem. Or rather,
> parts of it *uses* programming to solve the statistics problem.
>
> My statistics isn't good enough to tell you how to find correlations
> between three independent variables, but I can break the problem into a
> simpler one: find the correlation between two variables, temperature
> and energy consumption.

This was the initial starting point, but I thought that the comparing
multiples should set the tone
for how the data is interpreted, but you're right, I should start with
two, and then give relation to relation within the 2 object compared
structure.

So if x and y are compared and related, then it makes since that if x
and b are compared and related, that b and y are related in some way
because they have a in common in terms of 2 object comparison
relationals.

or:
(see below for comparative % based statistic analysis algorithm)
x and y = related
x and b = related
eg. y and b = related

but that gives the origin and the end comparison paradox of my end
desires for the program. Do I compare the end object to all or do
random 2 coordinate list comparisons and match the data over
corresponding timesteps, then eliminate the list of comparable based
on a hierarchy of matches, in other words?

if x relates to y and x relates to b:

So I have(really rough pseudo code for time constraints):

list1 = [+,+,+,+,-,+,-,+,-,+]
list2 = [-,+,+,+,-,+,-,+,-,+]

Above I have a 90% match to timestep increments/decrements, that over
say, one minute
periods, x and y both increased or decreased together 90% of the time,
or the opposite, that they diverged 90%
of the time.

>
> Without understanding how the data was generated, I'm not entirely sure
> how to set the data up, but here's one approach:
>
> (1) Plot the relationship between:
>    x = temperature
>    y = power consumption
>
>    where x is the independent variable and y is the dependent variable.
>
> (2) Look at the graph. Can you see any obvious pattern? If all the data
>    points are scattered randomly around the graph, there you can be
>    fairly sure that there is no correlation and you can go straight on
>    to calculating the correlation coefficient to make sure.
>
> (3) But if the graph clearly appears to be made of separate sections,
>    AND those sections correlate to physical differences due to the time
>    of day (position of the sun), then you need to break the data set
>    into multiple data sets and work on each one individually.
>
>    E.g. if the graph forms a straight line pointing DOWN for the hours
>    11pm to 5am, and a straight line pointing UP for the hours 5am
>    to 11pm, and you can think of a physical reason why this is
>    plausible, then you would be justified in separating out the data
>    into two independent sets: 5am-11pm, 11pm-5am.
>
>    If you want to have the program do this part for you, this is a VERY
>    hard problem. You're essentially wanting to write an artifical
>    intelligence system capable of picking out statistical correlations
>    from data. Such software does exist. It tends to cost hundreds of
>    thousands of dollars, or millions. Good luck writing your own!
>
> (4) Otherwise feel free to simplify the problem by just investigating
>    the relationship between temperature and power consumption during
>    (say) daylight hours.
>
> (5) However you decide to proceed, you should now have one (or more) x-y
>    graph. First step is to decide whether there is any correlation at
>    all. If there is not, you can stop there. Calculate the correlation
>    coefficient, r. r will be a number between -1 and 1. r=1 means a
>    perfect positive correlation; r=-1 means a perfect negative
>    correlation. r=0 means no correlation at all.
>
> (6) Decide whether the correlation is meaningful. I don't remember how
>    to do this -- consult your statistics text books. If it's not
>    meaningful, then you are done -- there's no statistically valid
>    relationship between the variables.
>
> (7) Otherwise, you want to calculate the line of best fit (or possibly
>    some other curve, but let's stick to straight lines for now) for the
>    data. The line of best fit may be complicated to calculate, and it
>    may not be justified statistically, so start off with something
>    simpler which (hopefully!) is nearly as good -- a linear regression
>    line. This calculates a line that statistically matches your data.
>
> (8) Technically, you can calculate a regression line for *any* data,
>    even if it clearly doesn't form a line. That's why you are checking
>    the correlation coefficient to decide whether it is sensible or not.
>
>
> By now any *real* statisticians reading this will be horrified :) What
> I've described is essentially the most basic, "Stats 101 for Dummies"
> level.
>
> Have fun!
>
>
>
> --
> Steven D'Aprano
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>