On Thu, Sep 26, 2013 at 6:42 PM, Nathaniel Smith
On 26 Sep 2013 21:59, "Faraz Mirzaei"
wrote: Thanks Josef and Nathaniel for your responses.
In the application that I have, I don't use the correlation coefficient matrix as a whole (so I don't care if it is PSD or not). I simply read the off-diagonal elements for pair-wise correlation coefficients. I use the pairwise correlation coefficient to test if the data from various sources (i.e., rows of the matrix), agree with each other when present.
Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the off-diagonal element in a loop over i and j. It is just a bit uglier than calling ma.corrcoef(x).
At least for my application, truncation to -1 or +1 (or scaling such that largest values becomes 1 etc) is completely wrong, since it would imply that the two sources completely agree with each other (factoring out a minus sign), which may not the case. For example, consider the first and last rows of the example I provided:
print x_ma [[ 7 -4 -1 -- -3 -2] [ 6 -3 0 4 0 5] [-4 -- 7 5 -- --] [-- 5 -- 0 1 4]]
np.ma.corrcoef(x_ma)[0,3] -1.6813456149534147
On the other hand, if we supply only the first and third row to the function, we get:
np.ma.corrcoef(x_ma[0,:], x_ma[3,:]) masked_array(data = [[1.0 -0.240192230708] [-0.240192230708 1.0]], mask = [[False False] [False False]], fill_value = 1e+20)
Interestingly, this is the same as what pandas results as the [3,0] element of the correlation coefficient matrix, and it is equal to pair-wise deletion result:
np.corrcoef([-4, -3, -2], [5, 1, 4]) #Note that this is NOT np.ma.corrcoef
array([[ 1. , -0.24019223], [-0.24019223, 1. ]])
Also, I don't know why the ma.corrcoef results Josef has mentioned are different than mine. In particular, Josef reports element [2, 0] of the ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked, probably due to too few samples available). Josef: are you sure that you have entered the example values correctly into python? Along the same lines, the results that Nathaniel has posted from R are different, since the input is not a masked matrix I guess (please note that in the original example, I had masked values less than or equal to -5).
Yes, sorry, this is just a cut and paste error - in fact the result I posted is what R gives for the stay with values <= -5 replaced by NA, but I left this line out of the email.
I think the only difference is that R and pandas give a correlation of 1.0 when there are only 1 or 2 data points, and ma.corrcoef returns masked in this case. Not sure which makes more sense.
In any case, I think the correlation coefficient between two rows of a matrix should not depend on what other rows are supplied. In other words, np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:], x_ma[3,:])[0,1] (which apparently happens to be what pandas reports).
This change would need recomputing the mean for every pair-wise coefficient calculation, but since we are computing cross products O(n^2) times, the overall big-O complexity won't change.
And please don't remove this functionality. I will volunteer to fix it however we decide :) We can just clarify the behavior in the documentation.
In the long run I prefer R's behaviour of requiring the user to specify before skipping anything, but I tend to agree that in the short term pairwise deletion is what ma.corrcoef users expect and what we should do. Maybe you could implement the fix and we could move the discussion to the PR?
pandas has a cython function in algos that loops over all pairs and calculates mean, cross product and standard deviation for each pair separately. I agree that that would be the best choice for pairwise deletion for np.ma.corrcoef, and cov Josef
-n
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion