[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef
josef.pktd at gmail.com
josef.pktd at gmail.com
Wed Sep 25 23:05:49 EDT 2013
On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei at gmail.com> wrote:
> Hi everyone,
>
> I'm using np.ma.corrcoef to compute the correlation coefficients among rows
> of a masked matrix, where the masked elements are the missing data. I've
> observed that in some cases, the np.ma.corrcoef gives invalid coefficients
> that are greater than 1 or less than -1.
>
> Here's an example:
>
> x = array([[ 7, -4, -1, -7, -3, -2],
> [ 6, -3, 0, 4, 0, 5],
> [-4, -5, 7, 5, -7, -7],
> [-5, 5, -8, 0, 1, 4]])
>
> x_ma = np.ma.masked_less_equal(x , -5)
>
> C = np.round(np.ma.corrcoef(x_ma), 2)
>
> print C
>
> [[1.0 0.73 -- -1.68]
> [0.73 1.0 -0.86 -0.38]
> [-- -0.86 1.0 --]
> [-1.68 -0.38 -- 1.0]]
>
> As you can see, the [0,3] element is -1.68 which is not a valid correlation
> coefficient. (Valid correlation coefficients should be between -1 and 1).
>
> I looked at the code for np.ma.corrcoef, and this behavior seems to be due
> to the way that mean values of the rows of the input matrix are computed and
> subtracted from them. Apparently, the mean value is individually computed
> for each row, without masking the elements corresponding to the masked
> elements of the other row of the matrix, with respect to which the
> correlation coefficient is being computed.
>
> I guess the right way should be to recompute the mean value for each row
> every time that a correlation coefficient is being computed for two rows
> after propagating pair-wise masked values.
>
> Please let me know what you think.
just general comments, I have no experience here
>From what you are saying it sounds like np.ma is not doing pairwise
deletion in calculating the mean (which only requires ignoring
missings in one array), however it does (correctly) do pairwise
deletion in calculating the cross product.
covariance or correlation matrices with pairwise deletion are not
necessarily "proper" covariance or correlation matrices.
I've read that they don't need to be positive semi-definite, but I've
never heard of values outside of [-1, 1]. It might only be a problem
if you have a large fraction of missing values..
I think the current behavior in np.ma makes sense in that it uses all
the information available in estimating the mean, which should be more
accurate if we use more information. But it makes cov and corrcoef
even weirder than they already are with pairwise deletion.
Row-wise deletion (deleting observations that have at least one
missing), which would create "proper" correlation matrices, wouldn't
produce much in your example.
I would check what R or other packages are doing and follow their
lead, or add another option.
(similar: we had a case in statsmodels where I used initially all
information for calculating the mean, but then we dropped some
observations to match the behavior of Stata, and to use the same
observations for calculating the mean and the follow up statistics.)
looks like pandas might be truncating the correlations to [-1, 1] (I
didn't check)
>>> import pandas as pd
>>> x_pd = pd.DataFrame(x_ma.T)
>>> x_pd.corr()
0 1 2 3
0 1.000000 0.734367 -1.000000 -0.240192
1 0.734367 1.000000 -0.856565 -0.378777
2 -1.000000 -0.856565 1.000000 NaN
3 -0.240192 -0.378777 NaN 1.000000
>>> np.round(np.ma.corrcoef(x_ma), 6)
masked_array(data =
[[1.0 0.734367 -1.190909 -1.681346]
[0.734367 1.0 -0.856565 -0.378777]
[-1.190909 -0.856565 1.0 --]
[-1.681346 -0.378777 -- 1.0]],
mask =
[[False False False False]
[False False False False]
[False False False True]
[False False True False]],
fill_value = 1e+20)
Josef
>
> Thanks,
>
> Faraz
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
More information about the NumPy-Discussion
mailing list