[Tutor] Matching relational data
smokefloat at gmail.com
Mon Oct 4 04:20:35 CEST 2010
On Sun, Oct 3, 2010 at 6:37 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Mon, 4 Oct 2010 08:33:07 am David Hutto wrote:
>> I'm creating an app that charts/graphs data. The mapping of the
>> graphs is the 'easy' part with matplotlib,
>> and wx. My question relates to the alignment of the data to be
>> Let's say I have three sets of 24 hr graphs with the same time steps:
>> -the position of the sun
>> -the temp.
>> -local powerplant energy consumption
>> A human could perceive the relations that when it's wintertime, cold
>> and the sun goes down, heaters are turned on
>> and energy consumption goes up, and the opposite in summer when it
>> the sun comes up.
>> My problem is how to compare and make the program perceive the
> This is a statistics problem, not a programming problem. Or rather,
> parts of it *uses* programming to solve the statistics problem.
> My statistics isn't good enough to tell you how to find correlations
> between three independent variables, but I can break the problem into a
> simpler one: find the correlation between two variables, temperature
> and energy consumption.
This was the initial starting point, but I thought that the comparing
multiples should set the tone
for how the data is interpreted, but you're right, I should start with
two, and then give relation to relation within the 2 object compared
So if x and y are compared and related, then it makes since that if x
and b are compared and related, that b and y are related in some way
because they have a in common in terms of 2 object comparison
(see below for comparative % based statistic analysis algorithm)
x and y = related
x and b = related
eg. y and b = related
but that gives the origin and the end comparison paradox of my end
desires for the program. Do I compare the end object to all or do
random 2 coordinate list comparisons and match the data over
corresponding timesteps, then eliminate the list of comparable based
on a hierarchy of matches, in other words?
if x relates to y and x relates to b:
So I have(really rough pseudo code for time constraints):
list1 = [+,+,+,+,-,+,-,+,-,+]
list2 = [-,+,+,+,-,+,-,+,-,+]
Above I have a 90% match to timestep increments/decrements, that over
say, one minute
periods, x and y both increased or decreased together 90% of the time,
or the opposite, that they diverged 90%
of the time.
> Without understanding how the data was generated, I'm not entirely sure
> how to set the data up, but here's one approach:
> (1) Plot the relationship between:
> x = temperature
> y = power consumption
> where x is the independent variable and y is the dependent variable.
> (2) Look at the graph. Can you see any obvious pattern? If all the data
> points are scattered randomly around the graph, there you can be
> fairly sure that there is no correlation and you can go straight on
> to calculating the correlation coefficient to make sure.
> (3) But if the graph clearly appears to be made of separate sections,
> AND those sections correlate to physical differences due to the time
> of day (position of the sun), then you need to break the data set
> into multiple data sets and work on each one individually.
> E.g. if the graph forms a straight line pointing DOWN for the hours
> 11pm to 5am, and a straight line pointing UP for the hours 5am
> to 11pm, and you can think of a physical reason why this is
> plausible, then you would be justified in separating out the data
> into two independent sets: 5am-11pm, 11pm-5am.
> If you want to have the program do this part for you, this is a VERY
> hard problem. You're essentially wanting to write an artifical
> intelligence system capable of picking out statistical correlations
> from data. Such software does exist. It tends to cost hundreds of
> thousands of dollars, or millions. Good luck writing your own!
> (4) Otherwise feel free to simplify the problem by just investigating
> the relationship between temperature and power consumption during
> (say) daylight hours.
> (5) However you decide to proceed, you should now have one (or more) x-y
> graph. First step is to decide whether there is any correlation at
> all. If there is not, you can stop there. Calculate the correlation
> coefficient, r. r will be a number between -1 and 1. r=1 means a
> perfect positive correlation; r=-1 means a perfect negative
> correlation. r=0 means no correlation at all.
> (6) Decide whether the correlation is meaningful. I don't remember how
> to do this -- consult your statistics text books. If it's not
> meaningful, then you are done -- there's no statistically valid
> relationship between the variables.
> (7) Otherwise, you want to calculate the line of best fit (or possibly
> some other curve, but let's stick to straight lines for now) for the
> data. The line of best fit may be complicated to calculate, and it
> may not be justified statistically, so start off with something
> simpler which (hopefully!) is nearly as good -- a linear regression
> line. This calculates a line that statistically matches your data.
> (8) Technically, you can calculate a regression line for *any* data,
> even if it clearly doesn't form a line. That's why you are checking
> the correlation coefficient to decide whether it is sensible or not.
> By now any *real* statisticians reading this will be horrified :) What
> I've described is essentially the most basic, "Stats 101 for Dummies"
> Have fun!
> Steven D'Aprano
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
More information about the Tutor