numpy/scipy: error of correlation coefficient (clumpy data)

Wed Nov 15 08:56:18 EST 2006

sturlamolden wrote:
> robert wrote:
> 
>>> t = r * sqrt( (n-2)/(1-r**2) )
> 
>> yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
>> that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?
> 
> I gave you th formula. Solve for r and you get the confidence interval.
> You will need to use the inverse cumulative Student t distribution.
> 
> Another quick-and-dirty solution is to use bootstrapping.
> 
> from numpy import mean, std, sum, sqrt, sort
> from numpy.random import randint
> 
> def bootstrap_correlation(x,y):
>    idx = randint(len(x),size=(1000,len(x)))
>    bx = x[idx] # reasmples x with replacement
>    by = y[idx] # resamples y with replacement
>    mx = mean(bx,1)
>    my = mean(by,1)
>    sx = std(bx,1)
>    sy = std(by,1)
>    r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
>       (by - my.repeat(len(y),0).reshape(by.shape)), 1) /
> ((len(x)-1)*sx*sy))
>    #bootstrap confidence interval (NB! biased)
>    return (r[25],r[975])
> 
> 
>> My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line.
> 
> The bootstrap is "non-parametric" in the sense that it is distribution
> free.

thanks for the bootstrap tester. It confirms mainly the "r_stderr = (1-r**2)/sqrt(n)" formula. The assymetry of r (-1..+1) is less a problem.
Yet my main problem, how to respect clumpy distribution in the data points, is still the same.
In practice think of a situation where data out of an experiment has an unkown damping/filter (or whatever unkown data clumper) on it, thus lots of redundancy in effect. 
An extreme example is to just duplicate data:

>>> x ,y =[0.,0,0,0,1]*10 ,[0.,1,1,1,1]*10
>>> xx,yy=[0.,0,0,0,1]*100,[0.,1,1,1,1]*100
>>> correlation(x,y)
(0.25, 0.132582521472, 0.25, 0.75)
>>> correlation(xx,yy)
(0.25, 0.0419262745781, 0.25, 0.75)
>>> bootstrap_correlation(array(x),array(y))
(0.148447544378, 0.375391432338)
>>> bootstrap_correlation(array(xx),array(yy))
(0.215668822617, 0.285633303438)
>>> 

here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.

The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.
So far I can only think of kind of geometric density approach...

Or is there a commonly known straight forward approach/formula for this problem?
In this formula which I can remember weakly somehow - I think there were other basic sum terms like sum_xxy, sum_xyy,.. in it (which are not needed for the formula for r itself )

Robert