[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Wed Sep 25 20:26:34 EDT 2013

Hi everyone,

I'm using np.ma.corrcoef to compute the correlation coefficients among rows
of a masked matrix, where the masked elements are the missing data. I've
observed that in some cases, the np.ma.corrcoef gives invalid coefficients
that are greater than 1 or less than -1.

Here's an example:

x = array([[ 7, -4, -1, -7, -3, -2],
       [ 6, -3,  0,  4,  0,  5],
       [-4, -5,  7,  5, -7, -7],
       [-5,  5, -8,  0,  1,  4]])

x_ma = np.ma.masked_less_equal(x , -5)

C = np.round(np.ma.corrcoef(x_ma), 2)

print C

[[1.0    0.73    --     -1.68]
 [0.73  1.0     -0.86 -0.38]
 [--      -0.86   1.0   --]
 [-1.68 -0.38   --     1.0]]

As you can see, the [0,3] element is -1.68 which is not a valid correlation
coefficient. (Valid correlation coefficients should be between -1 and 1).

I looked at the code for np.ma.corrcoef, and this behavior seems to be due
to the way that mean values of the rows of the input matrix are computed
and subtracted from them. Apparently, the mean value is individually
computed for each row, without masking the elements corresponding to the
masked elements of the other row of the matrix, with respect to which the
correlation coefficient is being computed.

I guess the right way should be to recompute the mean value for each row
every time that a correlation coefficient is being computed for two rows
after propagating pair-wise masked values.

Please let me know what you think.

Thanks,

Faraz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130925/9082a186/attachment.html>