On Friday 25 May 2007 19:18, Robert Kern wrote:
Jesper Larsen wrote:
Hi numpy users,
I have a masked array of dimension (nvariables, nobservations) that contain missing values at arbitrary points. Is it safe to rely on numpy.corrcoeff to calculate the correlation coefficients of a masked array (it seems to give reasonable results)?
No, it isn't. There are several different options for estimating correlations in the face of missing data, none of which are clearly superior to the others. None of them are trivial. None of them are implemented in numpy.
Thanks, my previous post was sent a bit too early since it became clear to me by reading the code for corrcoef that it is not safe for use with masked arrays. Here is my solution for calculating the correlation coefficients for masked arrays. Comments are appreciated: def macorrcoef(data1, data2): """ Calculates correlation coefficients taking masked out values into account. It is assumed (but not checked) that data1.shape == data2.shape. """ nv, no = data1.shape cc = ma.array(0., mask=ones((nv, nv))) if no > 1: for i in range(nv): for j in range(nv): m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:]) d1 = ma.array(data1[i,:], copy=False, mask=m).compressed() d2 = ma.array(data2[j,:], copy=False, mask=m).compressed() if ma.count(d1) > 1: c = corrcoef(d1, d2) cc[i,j] = c[0,1] return cc - Jesper