Re: [Numpy-discussion] corrcoef of masked array

30 May 2007

      On Friday 25 May 2007 19:18, Robert Kern wrote:
...
Jesper Larsen wrote:
...
Hi numpy users,
I have a masked array of dimension (nvariables, nobservations) that
contain missing values at arbitrary points. Is it safe to rely on
numpy.corrcoeff to calculate the correlation coefficients of a masked
array (it seems to give reasonable results)?
No, it isn't. There are several different options for estimating
correlations in the face of missing data, none of which are clearly
superior to the others. None of them are trivial. None of them are
implemented in numpy.
Thanks, my previous post was sent a bit too early since it became clear to me 
by reading the code for corrcoef that it is not safe for use with masked 
arrays.

Here is my solution for calculating the correlation coefficients for masked 
arrays. Comments are appreciated:

def macorrcoef(data1, data2):
  """
  Calculates correlation coefficients taking masked out values
  into account.

  It is assumed (but not checked) that data1.shape == data2.shape.
  """
  nv, no = data1.shape
  cc = ma.array(0., mask=ones((nv, nv)))
  if no > 1:
    for i in range(nv):
      for j in range(nv):
        m = ma.getmaskarray(data1[i,:]) | ma.getmaskarray(data2[j,:])
        d1 = ma.array(data1[i,:], copy=False, mask=m).compressed()
        d2 = ma.array(data2[j,:], copy=False, mask=m).compressed()
        if ma.count(d1) > 1:
          c = corrcoef(d1, d2)
          cc[i,j] = c[0,1]

  return cc

- Jesper