numpy/scipy: error of correlation coefficient (clumpy data)

Thu Nov 16 15:47:45 EST 2006

sturlamolden wrote:
> robert wrote:
> 
>> Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.
> 
> 
> In which case you have time series data, i.e. regular samples from p(t)
> = [ x(t), y(t) ]. Time series have some sort of autocorrelation in the
> samples as well, which must be taken into account. Even tough you could
> weight each point by the drunkard's speed, a correlation or linear
> regression would still not make any sense here, as such analyses are
> based on the assumption of no autocorrelation in the samples or the
> residuals. Correlation has no meaning if y[t] is correlated with
> y[t+1], and regression has no meaning if the residual e[t] is
> correlated with the residual e[t+1].
> 
> A state space model could e.g. be applicable. You could estimate the
> path of the drunkard using a Kalman filter to compute a Taylor series
> expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step
> p(t). When you have estimates for the state parameters s, v, and a, you
> can compute some sort of measure for the drunkard's deviation from his
> ideal path.
> 
> However, if you don't have time series data, you should not treat your
> data as such.
> 
> If you don't know how your data is generated, there is no way to deal
> with them correctly. If the samples are time series they must be
> threated as such, if they are not they should not. If the samples are
> i.i.d. each point count equally much, if they are not they do not. If
> you have a clumped data due to time series or lack of i.i.d., you must
> deal with that. However, data can be i.i.d. and clumped, if the
> underlying distribution is clumped. In order to determine the cause,
> you must consider how your data are generated and how your data are
> sampled. You need meta-information about your data to determine this.
> Matlab or Octave will help you with this, and it is certainly not a
> weakness of NumPy as you implied in your original post. There is no way
> to put magic into any numerical computation. Statistics always require
> formulation of specific assumptions about the data. If you cannot think
> clearly about your data, then that is the problem you must solve.

yes, in the example of the drunkard time-series its possible to go to better model - yet even there it is very expensive (in relation to the stats-improvement) to worry too much about the best model for such guy :-).

In the field of datamining with many datatracks and typically digging first for multiple but smaller correlations - without a practical bottom-up modell, I think one falls regularly back to a certain basic case - maybe the most basic model for data at all: that basic modell is possibly that of a "hunter" which waits mostly, but only acts, if rare goodies are in front of him. 
Again in the most basic case, when having 2D x,y data in front of you without a relyable time-path or so, you see this: a density distribution of points. There is possibly a linear correlation on the highest scale - which you are interested in - but the points show also inhomgenity/clumping, and this rises the question of influence on r_err. What now? One sees clearly that its nonsense to make just plain average stats.
I think this case is a most basic default for data - even compared to the common textbook-i.i.d. case.  In fact, one can recognize such kind of stats, which  repects mere (inhomogous) data-density itself, as (kind of simple/indepent) auto-bayesian stats vs. dumb averaging. 

I think one can almost always do this "bayesian density weighter/filter" as better option compared to mere average stats in that case of x,y correlation when there is obvioulsy interesting correlation but where you are too lazy..to..principally unable to itch out a modell on physics level. The latter requirement is in fact what any averaging stats cries for at any price - but how often can you do it in real world applications ... 

( In reality there is anyway no way to eliminate auto-correlation in the composition of data. Everything and everybody lies :-) )

Thats where a top-down (model-free) bayesian stats approach will pay off: In the previous extreme example of criminal data duplication - I'm sure - it will totally neutralize the attack without question. In the drunkard time-series example it will tell me very reliably how well this guy will keep track - without need for a complex model. In the case of good i.i.d. data distribution it will tell me the same as simple stats. Just good news ...

Thus I can possibly say it so now: I have the problem for guessing linear correlation (coefficient with error) on x,y data with respect to the (most general assupmtion) "bayesian" background of inhomogenous data distribution. 
Therefore I'm seeking a (fast/efficient/approx.) formula for r/r_err. I guess the formula for r does not change (much) compared to that for simple averaging stats, but the formula for r_err will.

Maybe its easy with some existing means of numpy/scipy already. Maybe not. I'm far from finding the (efficient) math myself, but I know what I want - and can see if a formula really does it.

Robert