[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Thu Sep 26 19:01:14 EDT 2013

On Thu, Sep 26, 2013 at 6:42 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On 26 Sep 2013 21:59, "Faraz Mirzaei" <fmmirzaei at gmail.com> wrote:
>>
>> Thanks Josef and Nathaniel for your responses.
>>
>> In the application that I have, I don't use the correlation coefficient
>> matrix as a whole (so I don't care if it is PSD or not). I simply read the
>> off-diagonal elements for pair-wise correlation coefficients. I use the
>> pairwise correlation coefficient to test if the data from various sources
>> (i.e., rows of the matrix), agree with each other when present.
>>
>> Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the
>> off-diagonal element in a loop over i and j. It is just a bit uglier than
>> calling ma.corrcoef(x).
>>
>> At least for my application, truncation to -1 or +1 (or scaling such that
>> largest values becomes 1 etc) is completely wrong, since it would imply that
>> the two sources completely agree with each other (factoring out a minus
>> sign), which may not the case. For example, consider the first and last rows
>> of the example I provided:
>>
>> >>> print x_ma
>> [[ 7 -4  -1  --  -3  -2]
>>  [ 6 -3   0  4   0   5]
>>  [-4  --   7  5  --   --]
>>  [--   5   --  0   1  4]]
>>
>> >>> np.ma.corrcoef(x_ma)[0,3]
>> -1.6813456149534147
>>
>>
>> On the other hand, if we supply only the first and third row to the
>> function, we get:
>>
>> >>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:])
>> masked_array(data =
>>  [[1.0 -0.240192230708]
>>  [-0.240192230708 1.0]],
>>              mask =
>>  [[False False]
>>  [False False]],
>>        fill_value = 1e+20)
>>
>> Interestingly, this is the same as what pandas results as the [3,0]
>> element of the correlation coefficient matrix, and it is equal to pair-wise
>> deletion result:
>>
>> >>> np.corrcoef([-4, -3, -2], [5, 1, 4])  #Note that this is NOT
>> >>> np.ma.corrcoef
>> >>>
>> array([[ 1.        , -0.24019223],
>>        [-0.24019223,  1.        ]])
>>
>>
>> Also, I don't know why the ma.corrcoef results Josef has mentioned are
>> different than mine. In particular, Josef reports element [2, 0] of the
>> ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked,
>> probably due to too few samples available). Josef: are you sure that you
>> have entered the example values correctly into python? Along the same lines,
>> the results that Nathaniel has posted from R are different, since the input
>> is not a masked matrix I guess (please note that in the original example, I
>> had masked values less than or equal to -5).
>
> Yes, sorry, this is just a cut and paste error - in fact the result I posted
> is what R gives for the stay with values <= -5 replaced by NA, but I left
> this line out of the email.
>
> I think the only difference is that R and pandas give a correlation of 1.0
> when there are only 1 or 2 data points, and ma.corrcoef returns masked in
> this case. Not sure which makes more sense.
>
>>
>> In any case, I think the correlation coefficient between two rows of a
>> matrix should not depend on what other rows are supplied. In other words,
>> np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:],
>> x_ma[3,:])[0,1] (which apparently happens to be what pandas reports).
>>
>> This change would need recomputing the mean for every pair-wise
>> coefficient calculation, but since we are computing cross products O(n^2)
>> times, the overall big-O complexity won't change.
>>
>> And please don't remove this functionality. I will volunteer to fix it
>> however we decide :) We can just clarify the behavior in the documentation.
>
> In the long run I prefer R's behaviour of requiring the user to specify
> before skipping anything, but I tend to agree that in the short term
> pairwise deletion is what ma.corrcoef users expect and what we should do.
> Maybe you could implement the fix and we could move the discussion to the
> PR?

pandas has a cython function in algos that loops over all pairs and
calculates mean, cross product and standard deviation for each pair
separately.

I agree that that would be the best choice for pairwise deletion for
np.ma.corrcoef, and cov

Josef

>
> -n
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>