Re: [Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

26 Sep 2013


      On Thu, Sep 26, 2013 at 6:42 PM, Nathaniel Smith  wrote:
...
On 26 Sep 2013 21:59, "Faraz Mirzaei"  wrote:
...
Thanks Josef and Nathaniel for your responses.
In the application that I have, I don't use the correlation coefficient
matrix as a whole (so I don't care if it is PSD or not). I simply read the
off-diagonal elements for pair-wise correlation coefficients. I use the
pairwise correlation coefficient to test if the data from various sources
(i.e., rows of the matrix), agree with each other when present.
Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the
off-diagonal element in a loop over i and j. It is just a bit uglier than
calling ma.corrcoef(x).
At least for my application, truncation to -1 or +1 (or scaling such that
largest values becomes 1 etc) is completely wrong, since it would imply that
the two sources completely agree with each other (factoring out a minus
sign), which may not the case. For example, consider the first and last rows
of the example I provided:
...
...
...
print x_ma
[[ 7 -4  -1  --  -3  -2]
 [ 6 -3   0  4   0   5]
 [-4  --   7  5  --   --]
 [--   5   --  0   1  4]]
...
...
...
np.ma.corrcoef(x_ma)[0,3]
-1.6813456149534147
On the other hand, if we supply only the first and third row to the
function, we get:
...
...
...
np.ma.corrcoef(x_ma[0,:], x_ma[3,:])
masked_array(data =
 [[1.0 -0.240192230708]
 [-0.240192230708 1.0]],
             mask =
 [[False False]
 [False False]],
       fill_value = 1e+20)
Interestingly, this is the same as what pandas results as the [3,0]
element of the correlation coefficient matrix, and it is equal to pair-wise
deletion result:
...
...
...
np.corrcoef([-4, -3, -2], [5, 1, 4])  #Note that this is NOT
np.ma.corrcoef
array([[ 1.        , -0.24019223],
       [-0.24019223,  1.        ]])
Also, I don't know why the ma.corrcoef results Josef has mentioned are
different than mine. In particular, Josef reports element [2, 0] of the
ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked,
probably due to too few samples available). Josef: are you sure that you
have entered the example values correctly into python? Along the same lines,
the results that Nathaniel has posted from R are different, since the input
is not a masked matrix I guess (please note that in the original example, I
had masked values less than or equal to -5).
Yes, sorry, this is just a cut and paste error - in fact the result I posted
is what R gives for the stay with values <= -5 replaced by NA, but I left
this line out of the email.
I think the only difference is that R and pandas give a correlation of 1.0
when there are only 1 or 2 data points, and ma.corrcoef returns masked in
this case. Not sure which makes more sense.
...
In any case, I think the correlation coefficient between two rows of a
matrix should not depend on what other rows are supplied. In other words,
np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:],
x_ma[3,:])[0,1] (which apparently happens to be what pandas reports).
This change would need recomputing the mean for every pair-wise
coefficient calculation, but since we are computing cross products O(n^2)
times, the overall big-O complexity won't change.
And please don't remove this functionality. I will volunteer to fix it
however we decide :) We can just clarify the behavior in the documentation.
In the long run I prefer R's behaviour of requiring the user to specify
before skipping anything, but I tend to agree that in the short term
pairwise deletion is what ma.corrcoef users expect and what we should do.
Maybe you could implement the fix and we could move the discussion to the
PR?
pandas has a cython function in algos that loops over all pairs and
calculates mean, cross product and standard deviation for each pair
separately.

I agree that that would be the best choice for pairwise deletion for
np.ma.corrcoef, and cov

Josef
...
-n
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

josef.pktd＠gmail.com