corrcoef of masked array
![](https://secure.gravatar.com/avatar/24fbc6670ed0eb41daf22091f84877d9.jpg?s=120&d=mm&r=g)
Hi numpy users, I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)? Cheers, Jesper
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Jesper Larsen wrote:
No, it isn't. There are several different options for estimating correlations in the face of missing data, none of which are clearly superior to the others. None of them are trivial. None of them are implemented in numpy. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/24fbc6670ed0eb41daf22091f84877d9.jpg?s=120&d=mm&r=g)
On Friday 25 May 2007 19:18, Robert Kern wrote:
Thanks, my previous post was sent a bit too early since it became clear to me by reading the code for corrcoef that it is not safe for use with masked arrays. Here is my solution for calculating the correlation coefficients for masked arrays. Comments are appreciated: def macorrcoef(data1, data2): """ Calculates correlation coefficients taking masked out values into account. It is assumed (but not checked) that data1.shape == data2.shape. """ nv, no = data1.shape cc = ma.array(0., mask=ones((nv, nv))) if no > 1: for i in range(nv): for j in range(nv): m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:]) d1 = ma.array(data1[i,:], copy=False, mask=m).compressed() d2 = ma.array(data2[j,:], copy=False, mask=m).compressed() if ma.count(d1) > 1: c = corrcoef(d1, d2) cc[i,j] = c[0,1] return cc - Jesper
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Jesper Larsen wrote:
I'm afraid this doesn't work, either. Correlation matrices are constrained to be positive semidefinite; that is, all of their eigenvalues must be >= 0. Calculating each of the correlation coefficients in a pairwise fashion doesn't incorporate this constraint. But you're on the right track. My preferred approach to this problem is to find the pairwise correlation matrix as you did and then find the closest positive semidefinite matrix to it using the method of alternating projections. I can't give you the code I wrote for this since it belongs to a customer, but here is the reference I used: http://eprints.ma.man.ac.uk/232/ -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
Hi Jesper On Fri, May 25, 2007 at 10:37:44AM +0200, Jesper Larsen wrote:
I don't think it is. If my thinking is correct, you would expect the following to have different results: In [38]: x = N.random.random(100) In [39]: y = N.random.random(100) In [40]: N.corrcoef(x,y) Out[40]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]]) In [41]: x_ = N.ma.masked_array(x,mask=(N.random.random(100)>0.5).astype(bool)) In [42]: y_ = N.ma.masked_array(y,mask=(N.random.random(100)>0.5).astype(bool)) In [43]: N.corrcoef(x_,y_) Out[43]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]]) Regards Stéfan
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Jesper Larsen wrote:
No, it isn't. There are several different options for estimating correlations in the face of missing data, none of which are clearly superior to the others. None of them are trivial. None of them are implemented in numpy. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/24fbc6670ed0eb41daf22091f84877d9.jpg?s=120&d=mm&r=g)
On Friday 25 May 2007 19:18, Robert Kern wrote:
Thanks, my previous post was sent a bit too early since it became clear to me by reading the code for corrcoef that it is not safe for use with masked arrays. Here is my solution for calculating the correlation coefficients for masked arrays. Comments are appreciated: def macorrcoef(data1, data2): """ Calculates correlation coefficients taking masked out values into account. It is assumed (but not checked) that data1.shape == data2.shape. """ nv, no = data1.shape cc = ma.array(0., mask=ones((nv, nv))) if no > 1: for i in range(nv): for j in range(nv): m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:]) d1 = ma.array(data1[i,:], copy=False, mask=m).compressed() d2 = ma.array(data2[j,:], copy=False, mask=m).compressed() if ma.count(d1) > 1: c = corrcoef(d1, d2) cc[i,j] = c[0,1] return cc - Jesper
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Jesper Larsen wrote:
I'm afraid this doesn't work, either. Correlation matrices are constrained to be positive semidefinite; that is, all of their eigenvalues must be >= 0. Calculating each of the correlation coefficients in a pairwise fashion doesn't incorporate this constraint. But you're on the right track. My preferred approach to this problem is to find the pairwise correlation matrix as you did and then find the closest positive semidefinite matrix to it using the method of alternating projections. I can't give you the code I wrote for this since it belongs to a customer, but here is the reference I used: http://eprints.ma.man.ac.uk/232/ -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
Hi Jesper On Fri, May 25, 2007 at 10:37:44AM +0200, Jesper Larsen wrote:
I don't think it is. If my thinking is correct, you would expect the following to have different results: In [38]: x = N.random.random(100) In [39]: y = N.random.random(100) In [40]: N.corrcoef(x,y) Out[40]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]]) In [41]: x_ = N.ma.masked_array(x,mask=(N.random.random(100)>0.5).astype(bool)) In [42]: y_ = N.ma.masked_array(y,mask=(N.random.random(100)>0.5).astype(bool)) In [43]: N.corrcoef(x_,y_) Out[43]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]]) Regards Stéfan
participants (3)
-
Jesper Larsen
-
Robert Kern
-
Stefan van der Walt