Hi numpy users,
I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)?
Cheers, Jesper
Jesper Larsen wrote:
Hi numpy users,
I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)?
No, it isn't. There are several different options for estimating correlations in the face of missing data, none of which are clearly superior to the others. None of them are trivial. None of them are implemented in numpy.
Hi Jesper
On Fri, May 25, 2007 at 10:37:44AM +0200, Jesper Larsen wrote:
I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)?
I don't think it is. If my thinking is correct, you would expect the following to have different results:
In [38]: x = N.random.random(100)
In [39]: y = N.random.random(100)
In [40]: N.corrcoef(x,y) Out[40]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]])
In [41]: x_ = N.ma.masked_array(x,mask=(N.random.random(100)>0.5).astype(bool))
In [42]: y_ = N.ma.masked_array(y,mask=(N.random.random(100)>0.5).astype(bool))
In [43]: N.corrcoef(x_,y_) Out[43]: array([[ 1. , -0.07291798], [-0.07291798, 1. ]])
Regards StÃ©fan
On Friday 25 May 2007 19:18, Robert Kern wrote:
Jesper Larsen wrote:
Hi numpy users,
I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)?
No, it isn't. There are several different options for estimating correlations in the face of missing data, none of which are clearly superior to the others. None of them are trivial. None of them are implemented in numpy.
Thanks, my previous post was sent a bit too early since it became clear to me by reading the code for corrcoef that it is not safe for use with masked arrays.
Here is my solution for calculating the correlation coefficients for masked arrays. Comments are appreciated:
def macorrcoef(data1, data2): """ Calculates correlation coefficients taking masked out values into account.
It is assumed (but not checked) that data1.shape == data2.shape. """ nv, no = data1.shape cc = ma.array(0., mask=ones((nv, nv))) if no > 1: for i in range(nv): for j in range(nv): m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:]) d1 = ma.array(data1[i,:], copy=False, mask=m).compressed() d2 = ma.array(data2[j,:], copy=False, mask=m).compressed() if ma.count(d1) > 1: c = corrcoef(d1, d2) cc[i,j] = c[0,1]
return cc
- Jesper
Jesper Larsen wrote:
Here is my solution for calculating the correlation coefficients for masked arrays. Comments are appreciated:
def macorrcoef(data1, data2): """ Calculates correlation coefficients taking masked out values into account.
It is assumed (but not checked) that data1.shape == data2.shape. """ nv, no = data1.shape cc = ma.array(0., mask=ones((nv, nv))) if no > 1: for i in range(nv): for j in range(nv): m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:]) d1 = ma.array(data1[i,:], copy=False, mask=m).compressed() d2 = ma.array(data2[j,:], copy=False, mask=m).compressed() if ma.count(d1) > 1: c = corrcoef(d1, d2) cc[i,j] = c[0,1]
return cc
I'm afraid this doesn't work, either. Correlation matrices are constrained to be positive semidefinite; that is, all of their eigenvalues must be >= 0. Calculating each of the correlation coefficients in a pairwise fashion doesn't incorporate this constraint.
But you're on the right track. My preferred approach to this problem is to find the pairwise correlation matrix as you did and then find the closest positive semidefinite matrix to it using the method of alternating projections. I can't give you the code I wrote for this since it belongs to a customer, but here is the reference I used:
http://eprints.ma.man.ac.uk/232/
On Wednesday 30 May 2007 19:48, Robert Kern wrote:
I'm afraid this doesn't work, either. Correlation matrices are constrained to be positive semidefinite; that is, all of their eigenvalues must be >= 0. Calculating each of the correlation coefficients in a pairwise fashion doesn't incorporate this constraint.
But you're on the right track. My preferred approach to this problem is to find the pairwise correlation matrix as you did and then find the closest positive semidefinite matrix to it using the method of alternating projections. I can't give you the code I wrote for this since it belongs to a customer, but here is the reference I used:
Robert,
Thanks your comments and for the reference. I will try to implement the algorithm sometime this month.
Cheers, Jesper