Re: [Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef
Thanks Josef and Nathaniel for your responses. In the application that I have, I don't use the correlation coefficient matrix as a whole (so I don't care if it is PSD or not). I simply read the off-diagonal elements for pair-wise correlation coefficients. I use the pairwise correlation coefficient to test if the data from various sources (i.e., rows of the matrix), agree with each other when present. Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the off-diagonal element in a loop over i and j. It is just a bit uglier than calling ma.corrcoef(x). At least for my application, truncation to -1 or +1 (or scaling such that largest values becomes 1 etc) is completely wrong, since it would imply that the two sources completely agree with each other (factoring out a minus sign), which may not the case. For example, consider the first and last rows of the example I provided:
print x_ma [[ 7 -4 -1 -- -3 -2] [ 6 -3 0 4 0 5] [-4 -- 7 5 -- --] [-- 5 -- 0 1 4]]
np.ma.corrcoef(x_ma)[0,3] -1.6813456149534147
On the other hand, if we supply only the first and third row to the function, we get:
np.ma.corrcoef(x_ma[0,:], x_ma[3,:]) masked_array(data = [[1.0 -0.240192230708] [-0.240192230708 1.0]], mask = [[False False] [False False]], fill_value = 1e+20)
Interestingly, this is the same as what pandas results as the [3,0] element of the correlation coefficient matrix, and it is equal to pair-wise deletion result:
np.corrcoef([-4, -3, -2], [5, 1, 4]) #Note that this is NOT np.ma.corrcoef
array([[ 1. , -0.24019223], [-0.24019223, 1. ]])
Also, I don't know why the ma.corrcoef results Josef has mentioned are different than mine. In particular, Josef reports element [2, 0] of the ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked, probably due to too few samples available). Josef: are you sure that you have entered the example values correctly into python? Along the same lines, the results that Nathaniel has posted from R are different, since the input is not a masked matrix I guess (please note that in the original example, I had masked values less than or equal to -5). In any case, I think the correlation coefficient between two rows of a matrix should not depend on what other rows are supplied. In other words, np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:], x_ma[3,:])[0,1] (which apparently happens to be what pandas reports). This change would need recomputing the mean for every pair-wise coefficient calculation, but since we are computing cross products O(n^2) times, the overall big-O complexity won't change. And please don't remove this functionality. I will volunteer to fix it however we decide :) We can just clarify the behavior in the documentation. Thanks, Faraz
participants (3)
-
Faraz Mirzaei
-
josef.pktd@gmail.com
-
Nathaniel Smith