[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Sep 26 06:51:34 EDT 2013


On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs at pobox.com> wrote:
> If you want a proper self-consistent correlation/covariance matrix, then
> pairwise deletion just makes no sense period, I don't see how postprocessing
> can help.

clipping to [-1, 1] and finding the nearest positive semi-definite matrix.
For the latter there is some code in statsmodels, and several newer
algorithms that I haven't looked at.

It's a quite common problem in finance, but usually longer time series
with not a large number of missing values.

>
> If you want a matrix of correlations, then pairwise deletion makes sense.
> It's an interesting point that arguably the current ma.corrcoef code may
> actually give you a better estimator of the individual correlation
> coefficients than doing full pairwise deletion, but it's pretty surprising
> and unexpected... when people call corrcoef I think they are asking "please
> compute the textbook formula for 'sample correlation'" not "please give me
> some arbitrary good estimator for the population correlation", so we
> probably have to change it.
>
> (Hopefully no-one has published anything based on the current code.)

I haven't seen a textbook version of this yet.

Calculating every mean (n + 1) * n / 2 times sounds a bit excessive,
especially if it doesn't really solve the problem.

Josef

>
> -n
>
> On 26 Sep 2013 04:19, <josef.pktd at gmail.com> wrote:
>>
>> On Wed, Sep 25, 2013 at 11:05 PM,  <josef.pktd at gmail.com> wrote:
>> > On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei at gmail.com>
>> > wrote:
>> >> Hi everyone,
>> >>
>> >> I'm using np.ma.corrcoef to compute the correlation coefficients among
>> >> rows
>> >> of a masked matrix, where the masked elements are the missing data.
>> >> I've
>> >> observed that in some cases, the np.ma.corrcoef gives invalid
>> >> coefficients
>> >> that are greater than 1 or less than -1.
>> >>
>> >> Here's an example:
>> >>
>> >> x = array([[ 7, -4, -1, -7, -3, -2],
>> >>        [ 6, -3,  0,  4,  0,  5],
>> >>        [-4, -5,  7,  5, -7, -7],
>> >>        [-5,  5, -8,  0,  1,  4]])
>> >>
>> >> x_ma = np.ma.masked_less_equal(x , -5)
>> >>
>> >> C = np.round(np.ma.corrcoef(x_ma), 2)
>> >>
>> >> print C
>> >>
>> >> [[1.0    0.73    --     -1.68]
>> >>  [0.73  1.0     -0.86 -0.38]
>> >>  [--      -0.86   1.0   --]
>> >>  [-1.68 -0.38   --     1.0]]
>> >>
>> >> As you can see, the [0,3] element is -1.68 which is not a valid
>> >> correlation
>> >> coefficient. (Valid correlation coefficients should be between -1 and
>> >> 1).
>> >>
>> >> I looked at the code for np.ma.corrcoef, and this behavior seems to be
>> >> due
>> >> to the way that mean values of the rows of the input matrix are
>> >> computed and
>> >> subtracted from them. Apparently, the mean value is individually
>> >> computed
>> >> for each row, without masking the elements corresponding to the masked
>> >> elements of the other row of the matrix, with respect to which the
>> >> correlation coefficient is being computed.
>> >>
>> >> I guess the right way should be to recompute the mean value for each
>> >> row
>> >> every time that a correlation coefficient is being computed for two
>> >> rows
>> >> after propagating pair-wise masked values.
>> >>
>> >> Please let me know what you think.
>> >
>> > just general comments, I have no experience here
>> >
>> > From what you are saying it sounds like np.ma is not doing pairwise
>> > deletion in calculating the mean (which only requires ignoring
>> > missings in one array), however it does (correctly) do pairwise
>> > deletion in calculating the cross product.
>>
>> Actually, I think the calculation of the mean is not relevant for
>> having weird correlation coefficients without clipping.
>>
>> With pairwise deletion you use different samples, subsets of the data,
>> for the variances and the covariances.
>> It should be easy (?) to construct examples where the pairwise
>> deletion for the covariance produces a large positive or negative
>> number, and both variances and standard deviations are small, using
>> two different subsamples.
>> Once you calculate the correlation coefficient, it could be all over
>> the place, independent of the mean calculations.
>>
>> conclusion: pairwise deletion requires post-processing if you want a
>> proper correlation matrix.
>>
>> Josef
>>
>> >
>> > covariance or correlation matrices with pairwise deletion are not
>> > necessarily "proper" covariance or correlation matrices.
>> > I've read that they don't need to be positive semi-definite, but I've
>> > never heard of values outside of [-1, 1]. It might only be a problem
>> > if you have a large fraction of missing values..
>> >
>> > I think the current behavior in np.ma makes sense in that it uses all
>> > the information available in estimating the mean, which should be more
>> > accurate if we use more information. But it makes cov and corrcoef
>> > even weirder than they already are with pairwise deletion.
>> >
>> > Row-wise deletion (deleting observations that have at least one
>> > missing), which would create "proper" correlation matrices, wouldn't
>> > produce much in your example.
>> >
>> > I would check what R or other packages are doing and follow their
>> > lead, or add another option.
>> > (similar: we had a case in statsmodels where I used initially all
>> > information for calculating the mean, but then we dropped some
>> > observations to match the behavior of Stata, and to use the same
>> > observations for calculating the mean and the follow up statistics.)
>> >
>> >
>> > looks like pandas might be truncating the correlations to [-1, 1] (I
>> > didn't check)
>> >
>> >>>> import pandas as pd
>> >>>> x_pd = pd.DataFrame(x_ma.T)
>> >>>> x_pd.corr()
>> >           0         1         2         3
>> > 0  1.000000  0.734367 -1.000000 -0.240192
>> > 1  0.734367  1.000000 -0.856565 -0.378777
>> > 2 -1.000000 -0.856565  1.000000       NaN
>> > 3 -0.240192 -0.378777       NaN  1.000000
>> >
>> >>>> np.round(np.ma.corrcoef(x_ma), 6)
>> > masked_array(data =
>> >  [[1.0 0.734367 -1.190909 -1.681346]
>> >  [0.734367 1.0 -0.856565 -0.378777]
>> >  [-1.190909 -0.856565 1.0 --]
>> >  [-1.681346 -0.378777 -- 1.0]],
>> >              mask =
>> >  [[False False False False]
>> >  [False False False False]
>> >  [False False False  True]
>> >  [False False  True False]],
>> >        fill_value = 1e+20)
>> >
>> >
>> > Josef
>> >
>> >
>> >>
>> >> Thanks,
>> >>
>> >> Faraz
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list