[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Thu Sep 26 16:59:46 EDT 2013

Thanks Josef and Nathaniel for your responses.

In the application that I have, I don't use the correlation coefficient
matrix as a whole (so I don't care if it is PSD or not). I simply read the
off-diagonal elements for pair-wise correlation coefficients. I use the
pairwise correlation coefficient to test if the data from various sources
(i.e., rows of the matrix), agree with each other when present.

Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the
off-diagonal element in a loop over i and j. It is just a bit uglier than
calling ma.corrcoef(x).

At least for my application, truncation to -1 or +1 (or scaling such that
largest values becomes 1 etc) is completely wrong, since it would imply
that the two sources completely agree with each other (factoring out a
minus sign), which may not the case. For example, consider the first and
last rows of the example I provided:

>>> print x_ma
[[ 7 -4  -1  --  -3  -2]
 [ 6 -3   0  4   0   5]
 [-4  --   7  5  --   --]
 [--   5   --  0   1  4]]

>>> np.ma.corrcoef(x_ma)[0,3]
-1.6813456149534147

On the other hand, if we supply only the first and third row to the
function, we get:

>>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:])
masked_array(data =
 [[1.0 -0.240192230708]
 [-0.240192230708 1.0]],
             mask =
 [[False False]
 [False False]],
       fill_value = 1e+20)

Interestingly, this is the same as what pandas results as the [3,0] element
of the correlation coefficient matrix, and it is equal to pair-wise
deletion result:

>>> np.corrcoef([-4, -3, -2], [5, 1, 4])  #Note that this is NOT
np.ma.corrcoef
>>>
array([[ 1.        , -0.24019223],
       [-0.24019223,  1.        ]])

Also, I don't know why the ma.corrcoef results Josef has mentioned are
different than mine. In particular, Josef reports element [2, 0] of the
ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked,
probably due to too few samples available). Josef: are you sure that you
have entered the example values correctly into python? Along the same
lines, the results that Nathaniel has posted from R are different, since
the input is not a masked matrix I guess (please note that in the original
example, I had masked values less than or equal to -5).

In any case, I think the correlation coefficient between two rows of a
matrix should not depend on what other rows are supplied. In other words,
np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:],
x_ma[3,:])[0,1] (which apparently happens to be what pandas reports).

This change would need recomputing the mean for every pair-wise coefficient
calculation, but since we are computing cross products O(n^2) times, the
overall big-O complexity won't change.

And please don't remove this functionality. I will volunteer to fix it
however we decide :) We can just clarify the behavior in the documentation.

Thanks,

Faraz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130926/76692da8/attachment.html>