numpy/scipy: correlation
robert
no-spam at no-spam-no-spam.invalid
Sun Nov 12 15:52:45 CET 2006
Robert Kern wrote:
> robert wrote:
>> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
>> m, m-err, o, o-err, r-coef,r-coef-err ?
>
> And of course, those three parameters are not particularly meaningful together.
> If your model is truly "y is a linear response given x with normal noise" then
> "y=m*x+o" is correct, and all of the information that you can get from the data
> will be found in the estimates of m and o and the covariance matrix of the
> estimates.
>
> On the other hand, if your model is that "(x, y) is distributed as a bivariate
> normal distribution" then "y=m*x+o" is not a particularly good representation of
> the model. You should instead estimate the mean vector and covariance matrix of
> (x, y). Your correlation coefficient will be the off-diagonal term after
> dividing out the marginal standard deviations.
>
> The difference between the two models is that the first places no restrictions
> on the distribution of x. The second does; both the x and y marginal
> distributions need to be normal. Under the first model, the correlation
> coefficient has no meaning.
Think the difference is little in practice - when you head for usable diagonals.
Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. )
Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points.
With big coef's and lots of distributed data the coef is very good by itself - its error range err(N) only approx ~ 1/sqrt(N)
One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.
Now think of adding 'boring data':
>>> X=[1.,2,3,4]
>>> Y=[1.,2,3,5]
>>> sd.correlation((X,Y)) # my old func
(1.3, -0.5, 0.982707629824) # m,o,coef
>>> numpy.corrcoef((X,Y))
array([[ 1. , 0.98270763],
[ 0.98270763, 1. ]])
>>> XX=[1.,1,1,1,1,2,3,4]
>>> YY=[1.,1,1,1,1,2,3,5]
>>> sd.correlation((XX,YY))
(1.23684210526, -0.289473684211, 0.988433774639)
>>>
I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
says:
In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.
Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.
http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y
tells:
probable error of r = 0.6745*(1-r**2)/sqrt(N)
A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from
0.0115628571429 to 0.00548453410954 !
And the absolute size of this error measure seems to be too low for just 4 points of data!
The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
Robert
PS:
my old func is simply hands-on based on
n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
Guess its already fast for large data?
Note: numpy.corrcoef strikes on 2 points:
>>> numpy.corrcoef(([1,2],[1,2]))
array([[ -1.#IND, -1.#IND],
[ -1.#IND, -1.#IND]])
>>> sd.correlation(([1,2],[1,2]))
(1, 0, 1.0)
>>>
>>> numpy.corrcoef(([1,2,3],[1,2,3]))
array([[ 1., 1.],
[ 1., 1.]])
>>> sd.correlation(([1,2,3],[1,2,3]))
(1, 0, 1.0)
PPS:
A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
More information about the Python-list
mailing list