Thanks Josef and Nathaniel for your responses.
In the application that I have, I don't use the correlation coefficient
matrix as a whole (so I don't care if it is PSD or not). I simply read the
off-diagonal elements for pair-wise correlation coefficients. I use the
pairwise correlation coefficient to test if the data from various sources
(i.e., rows of the matrix), agree with each other when present.
Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the
off-diagonal element in a loop over i and j. It is just a bit uglier than
calling ma.corrcoef(x).
At least for my application, truncation to -1 or +1 (or scaling such that
largest values becomes 1 etc) is completely wrong, since it would imply
that the two sources completely agree with each other (factoring out a
minus sign), which may not the case. For example, consider the first and
last rows of the example I provided:
>>> print x_ma
[[ 7 -4 -1 -- -3 -2]
[ 6 -3 0 4 0 5]
[-4 -- 7 5 -- --]
[-- 5 -- 0 1 4]]
>>> np.ma.corrcoef(x_ma)[0,3]
-1.6813456149534147
On the other hand, if we supply only the first and third row to the
function, we get:
>>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:])
masked_array(data =
[[1.0 -0.240192230708]
[-0.240192230708 1.0]],
mask =
[[False False]
[False False]],
fill_value = 1e+20)
Interestingly, this is the same as what pandas results as the [3,0] element
of the correlation coefficient matrix, and it is equal to pair-wise
deletion result:
>>> np.corrcoef([-4, -3, -2], [5, 1, 4]) #Note that this is NOT
np.ma.corrcoef
>>>
array([[ 1. , -0.24019223],
[-0.24019223, 1. ]])
Also, I don't know why the ma.corrcoef results Josef has mentioned are
different than mine. In particular, Josef reports element [2, 0] of the
ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked,
probably due to too few samples available). Josef: are you sure that you
have entered the example values correctly into python? Along the same
lines, the results that Nathaniel has posted from R are different, since
the input is not a masked matrix I guess (please note that in the original
example, I had masked values less than or equal to -5).
In any case, I think the correlation coefficient between two rows of a
matrix should not depend on what other rows are supplied. In other words,
np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:],
x_ma[3,:])[0,1] (which apparently happens to be what pandas reports).
This change would need recomputing the mean for every pair-wise coefficient
calculation, but since we are computing cross products O(n^2) times, the
overall big-O complexity won't change.
And please don't remove this functionality. I will volunteer to fix it
however we decide :) We can just clarify the behavior in the documentation.
Thanks,
Faraz