<div dir="ltr">Thanks Josef and Nathaniel for your responses.<div><br></div><div>In the application that I have, I don't use the correlation coefficient matrix as a whole (so I don't care if it is PSD or not). I simply read the off-diagonal elements for pair-wise correlation coefficients. I use the pairwise correlation coefficient to test if the data from various sources (i.e., rows of the matrix), agree with each other when present. </div>
<div><br></div><div>Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the off-diagonal element in a loop over i and j. It is just a bit uglier than calling ma.corrcoef(x).</div><div><div class="gmail_extra" style>
<br>At least for my application, truncation to -1 or +1 (or scaling such that largest values becomes 1 etc) is completely wrong, since it would imply that the two sources completely agree with each other (factoring out a minus sign), which may not the case. For example, consider the first and last rows of the example I provided:</div>
</div><div class="gmail_extra" style><br></div><div class="gmail_extra" style><div class="gmail_extra" style><div class="gmail_extra">>>> print x_ma</div><div class="gmail_extra">[[ 7 -4 -1 -- -3 -2]</div><div class="gmail_extra">
[ 6 -3 0 4 0 5]</div><div class="gmail_extra"> [-4 -- 7 5 -- --]</div><div class="gmail_extra"> [-- 5 -- 0 1 4]]</div><div class="gmail_extra"><br></div></div></div><div class="gmail_extra" style><div class="gmail_extra">
<font face="arial, sans-serif">>>> np.ma.corrcoef(x_ma)[0,3]</font></div><div class="gmail_extra"><font face="arial, sans-serif">-1.6813456149534147</font></div><div class="gmail_extra"><font face="arial, sans-serif"><br>
</font></div></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style>On the other hand, if we supply only the first and third row to the function, we get:</div><div class="gmail_extra" style><br></div>
<div class="gmail_extra" style>>>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:])</div><div class="gmail_extra"><div class="gmail_extra">masked_array(data =</div><div class="gmail_extra"> [[1.0 -0.240192230708]</div><div class="gmail_extra">
[-0.240192230708 1.0]],</div><div class="gmail_extra"> mask =</div><div class="gmail_extra"> [[False False]</div><div class="gmail_extra"> [False False]],</div><div class="gmail_extra"> fill_value = 1e+20)</div>
<div class="gmail_extra"><br></div><div class="gmail_extra" style>Interestingly, this is the same as what pandas results as the [3,0] element of the correlation coefficient matrix, and it is equal to pair-wise deletion result:</div>
<div class="gmail_extra" style><br></div><div class="gmail_extra" style><div class="gmail_extra">>>> np.corrcoef([-4, -3, -2], [5, 1, 4]) #Note that this is NOT np.ma.corrcoef</div><div class="gmail_extra">>>> </div>
<div class="gmail_extra">array([[ 1. , -0.24019223],</div><div class="gmail_extra"> [-0.24019223, 1. ]])</div></div></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div>
<div class="gmail_extra" style>Also, I don't know why the ma.corrcoef results Josef has mentioned are different than mine. In particular, Josef reports element [2, 0] of the ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked, probably due to too few samples available). Josef: are you sure that you have entered the example values correctly into python? Along the same lines, the results that Nathaniel has posted from R are different, since the input is not a masked matrix I guess (please note that in the original example, I had masked values less than or equal to -5).</div>
<div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style>In any case, I think the correlation coefficient between two rows of a matrix should not depend on what other rows are supplied. In other words, <span style="font-family:arial,sans-serif">np.ma.corrcoef(x_ma)[0,3] </span>should be equal to np.ma.corrcoef(x_ma[0,:], x_ma[3,:])[0,1]<span style="font-family:arial,sans-serif"> (which apparently happens to be what pandas reports).</span></div>
<div class="gmail_extra" style><span style="font-family:arial,sans-serif"><br></span></div><div class="gmail_extra" style><span style="font-family:arial,sans-serif">This change would need recomputing the mean for every pair-wise coefficient calculation, but since we are computing cross products O(n^2) times, the overall big-O complexity won't change. </span></div>
<div class="gmail_extra" style><span style="font-family:arial,sans-serif"><br></span></div><div class="gmail_extra" style><font face="arial, sans-serif">And please don't remove this functionality. I will volunteer to fix it however we decide :) We can just clarify the behavior in the documentation.</font></div>
<div class="gmail_extra" style><span style="font-family:arial,sans-serif"><br></span></div><div class="gmail_extra" style><span style="font-family:arial,sans-serif">Thanks,</span></div><div class="gmail_extra" style><span style="font-family:arial,sans-serif"><br>
</span></div><div class="gmail_extra" style><span style="font-family:arial,sans-serif">Faraz</span></div><div class="gmail_extra" style><span style="font-family:arial,sans-serif"><br></span></div><div class="gmail_extra" style>
<span style="font-family:arial,sans-serif"><br></span></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div>
<div class="gmail_extra" style><br></div><div class="gmail_extra" style><br></div></div>