[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

josef.pktd at gmail.com josef.pktd at gmail.com
Wed Sep 25 23:19:30 EDT 2013


On Wed, Sep 25, 2013 at 11:05 PM,  <josef.pktd at gmail.com> wrote:
> On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei at gmail.com> wrote:
>> Hi everyone,
>>
>> I'm using np.ma.corrcoef to compute the correlation coefficients among rows
>> of a masked matrix, where the masked elements are the missing data. I've
>> observed that in some cases, the np.ma.corrcoef gives invalid coefficients
>> that are greater than 1 or less than -1.
>>
>> Here's an example:
>>
>> x = array([[ 7, -4, -1, -7, -3, -2],
>>        [ 6, -3,  0,  4,  0,  5],
>>        [-4, -5,  7,  5, -7, -7],
>>        [-5,  5, -8,  0,  1,  4]])
>>
>> x_ma = np.ma.masked_less_equal(x , -5)
>>
>> C = np.round(np.ma.corrcoef(x_ma), 2)
>>
>> print C
>>
>> [[1.0    0.73    --     -1.68]
>>  [0.73  1.0     -0.86 -0.38]
>>  [--      -0.86   1.0   --]
>>  [-1.68 -0.38   --     1.0]]
>>
>> As you can see, the [0,3] element is -1.68 which is not a valid correlation
>> coefficient. (Valid correlation coefficients should be between -1 and 1).
>>
>> I looked at the code for np.ma.corrcoef, and this behavior seems to be due
>> to the way that mean values of the rows of the input matrix are computed and
>> subtracted from them. Apparently, the mean value is individually computed
>> for each row, without masking the elements corresponding to the masked
>> elements of the other row of the matrix, with respect to which the
>> correlation coefficient is being computed.
>>
>> I guess the right way should be to recompute the mean value for each row
>> every time that a correlation coefficient is being computed for two rows
>> after propagating pair-wise masked values.
>>
>> Please let me know what you think.
>
> just general comments, I have no experience here
>
> From what you are saying it sounds like np.ma is not doing pairwise
> deletion in calculating the mean (which only requires ignoring
> missings in one array), however it does (correctly) do pairwise
> deletion in calculating the cross product.

Actually, I think the calculation of the mean is not relevant for
having weird correlation coefficients without clipping.

With pairwise deletion you use different samples, subsets of the data,
for the variances and the covariances.
It should be easy (?) to construct examples where the pairwise
deletion for the covariance produces a large positive or negative
number, and both variances and standard deviations are small, using
two different subsamples.
Once you calculate the correlation coefficient, it could be all over
the place, independent of the mean calculations.

conclusion: pairwise deletion requires post-processing if you want a
proper correlation matrix.

Josef

>
> covariance or correlation matrices with pairwise deletion are not
> necessarily "proper" covariance or correlation matrices.
> I've read that they don't need to be positive semi-definite, but I've
> never heard of values outside of [-1, 1]. It might only be a problem
> if you have a large fraction of missing values..
>
> I think the current behavior in np.ma makes sense in that it uses all
> the information available in estimating the mean, which should be more
> accurate if we use more information. But it makes cov and corrcoef
> even weirder than they already are with pairwise deletion.
>
> Row-wise deletion (deleting observations that have at least one
> missing), which would create "proper" correlation matrices, wouldn't
> produce much in your example.
>
> I would check what R or other packages are doing and follow their
> lead, or add another option.
> (similar: we had a case in statsmodels where I used initially all
> information for calculating the mean, but then we dropped some
> observations to match the behavior of Stata, and to use the same
> observations for calculating the mean and the follow up statistics.)
>
>
> looks like pandas might be truncating the correlations to [-1, 1] (I
> didn't check)
>
>>>> import pandas as pd
>>>> x_pd = pd.DataFrame(x_ma.T)
>>>> x_pd.corr()
>           0         1         2         3
> 0  1.000000  0.734367 -1.000000 -0.240192
> 1  0.734367  1.000000 -0.856565 -0.378777
> 2 -1.000000 -0.856565  1.000000       NaN
> 3 -0.240192 -0.378777       NaN  1.000000
>
>>>> np.round(np.ma.corrcoef(x_ma), 6)
> masked_array(data =
>  [[1.0 0.734367 -1.190909 -1.681346]
>  [0.734367 1.0 -0.856565 -0.378777]
>  [-1.190909 -0.856565 1.0 --]
>  [-1.681346 -0.378777 -- 1.0]],
>              mask =
>  [[False False False False]
>  [False False False False]
>  [False False False  True]
>  [False False  True False]],
>        fill_value = 1e+20)
>
>
> Josef
>
>
>>
>> Thanks,
>>
>> Faraz
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>



More information about the NumPy-Discussion mailing list