[Numpy-discussion] corrcoef of masked array

Wed May 30 13:48:22 EDT 2007

Jesper Larsen wrote:

> Here is my solution for calculating the correlation coefficients for masked 
> arrays. Comments are appreciated:
> 
> def macorrcoef(data1, data2):
>   """
>   Calculates correlation coefficients taking masked out values
>   into account.
> 
>   It is assumed (but not checked) that data1.shape == data2.shape.
>   """
>   nv, no = data1.shape
>   cc = ma.array(0., mask=ones((nv, nv)))
>   if no > 1:
>     for i in range(nv):
>       for j in range(nv):
>         m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:])
>         d1 = ma.array(data1[i,:], copy=False, mask=m).compressed()
>         d2 = ma.array(data2[j,:], copy=False, mask=m).compressed()
>         if ma.count(d1) > 1:
>           c = corrcoef(d1, d2)
>           cc[i,j] = c[0,1]
> 
>   return cc

I'm afraid this doesn't work, either. Correlation matrices are constrained to be
positive semidefinite; that is, all of their eigenvalues must be >= 0.
Calculating each of the correlation coefficients in a pairwise fashion doesn't
incorporate this constraint.

But you're on the right track. My preferred approach to this problem is to find
the pairwise correlation matrix as you did and then find the closest positive
semidefinite matrix to it using the method of alternating projections. I can't
give you the code I wrote for this since it belongs to a customer, but here is
the reference I used:

  http://eprints.ma.man.ac.uk/232/

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco