[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Thu Sep 26 06:56:21 EDT 2013

On Thu, Sep 26, 2013 at 6:51 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs at pobox.com> wrote:
>> If you want a proper self-consistent correlation/covariance matrix, then
>> pairwise deletion just makes no sense period, I don't see how postprocessing
>> can help.
>
> clipping to [-1, 1] and finding the nearest positive semi-definite matrix.
> For the latter there is some code in statsmodels, and several newer
> algorithms that I haven't looked at.
>
> It's a quite common problem in finance, but usually longer time series
> with not a large number of missing values.
>
>>
>> If you want a matrix of correlations, then pairwise deletion makes sense.
>> It's an interesting point that arguably the current ma.corrcoef code may
>> actually give you a better estimator of the individual correlation
>> coefficients than doing full pairwise deletion, but it's pretty surprising
>> and unexpected... when people call corrcoef I think they are asking "please
>> compute the textbook formula for 'sample correlation'" not "please give me
>> some arbitrary good estimator for the population correlation", so we
>> probably have to change it.
>>
>> (Hopefully no-one has published anything based on the current code.)
>
> I haven't seen a textbook version of this yet.
>
> Calculating every mean (n + 1) * n / 2 times sounds a bit excessive,
> especially if it doesn't really solve the problem.

unless you also calculate each standard deviation (n + 1) * n / 2 times.
But then you loose the relationship between cov and corrcoeff.

Josef

>
> Josef
>
>>
>> -n
>>
>> On 26 Sep 2013 04:19, <josef.pktd at gmail.com> wrote:
>>>
>>> On Wed, Sep 25, 2013 at 11:05 PM,  <josef.pktd at gmail.com> wrote:
>>> > On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei at gmail.com>
>>> > wrote:
>>> >> Hi everyone,
>>> >>
>>> >> I'm using np.ma.corrcoef to compute the correlation coefficients among
>>> >> rows
>>> >> of a masked matrix, where the masked elements are the missing data.
>>> >> I've
>>> >> observed that in some cases, the np.ma.corrcoef gives invalid
>>> >> coefficients
>>> >> that are greater than 1 or less than -1.
>>> >>
>>> >> Here's an example:
>>> >>
>>> >> x = array([[ 7, -4, -1, -7, -3, -2],
>>> >>        [ 6, -3,  0,  4,  0,  5],
>>> >>        [-4, -5,  7,  5, -7, -7],
>>> >>        [-5,  5, -8,  0,  1,  4]])
>>> >>
>>> >> x_ma = np.ma.masked_less_equal(x , -5)
>>> >>
>>> >> C = np.round(np.ma.corrcoef(x_ma), 2)
>>> >>
>>> >> print C
>>> >>
>>> >> [[1.0    0.73    --     -1.68]
>>> >>  [0.73  1.0     -0.86 -0.38]
>>> >>  [--      -0.86   1.0   --]
>>> >>  [-1.68 -0.38   --     1.0]]
>>> >>
>>> >> As you can see, the [0,3] element is -1.68 which is not a valid
>>> >> correlation
>>> >> coefficient. (Valid correlation coefficients should be between -1 and
>>> >> 1).
>>> >>
>>> >> I looked at the code for np.ma.corrcoef, and this behavior seems to be
>>> >> due
>>> >> to the way that mean values of the rows of the input matrix are
>>> >> computed and
>>> >> subtracted from them. Apparently, the mean value is individually
>>> >> computed
>>> >> for each row, without masking the elements corresponding to the masked
>>> >> elements of the other row of the matrix, with respect to which the
>>> >> correlation coefficient is being computed.
>>> >>
>>> >> I guess the right way should be to recompute the mean value for each
>>> >> row
>>> >> every time that a correlation coefficient is being computed for two
>>> >> rows
>>> >> after propagating pair-wise masked values.
>>> >>
>>> >> Please let me know what you think.
>>> >
>>> > just general comments, I have no experience here
>>> >
>>> > From what you are saying it sounds like np.ma is not doing pairwise
>>> > deletion in calculating the mean (which only requires ignoring
>>> > missings in one array), however it does (correctly) do pairwise
>>> > deletion in calculating the cross product.
>>>
>>> Actually, I think the calculation of the mean is not relevant for
>>> having weird correlation coefficients without clipping.
>>>
>>> With pairwise deletion you use different samples, subsets of the data,
>>> for the variances and the covariances.
>>> It should be easy (?) to construct examples where the pairwise
>>> deletion for the covariance produces a large positive or negative
>>> number, and both variances and standard deviations are small, using
>>> two different subsamples.
>>> Once you calculate the correlation coefficient, it could be all over
>>> the place, independent of the mean calculations.
>>>
>>> conclusion: pairwise deletion requires post-processing if you want a
>>> proper correlation matrix.
>>>
>>> Josef
>>>
>>> >
>>> > covariance or correlation matrices with pairwise deletion are not
>>> > necessarily "proper" covariance or correlation matrices.
>>> > I've read that they don't need to be positive semi-definite, but I've
>>> > never heard of values outside of [-1, 1]. It might only be a problem
>>> > if you have a large fraction of missing values..
>>> >
>>> > I think the current behavior in np.ma makes sense in that it uses all
>>> > the information available in estimating the mean, which should be more
>>> > accurate if we use more information. But it makes cov and corrcoef
>>> > even weirder than they already are with pairwise deletion.
>>> >
>>> > Row-wise deletion (deleting observations that have at least one
>>> > missing), which would create "proper" correlation matrices, wouldn't
>>> > produce much in your example.
>>> >
>>> > I would check what R or other packages are doing and follow their
>>> > lead, or add another option.
>>> > (similar: we had a case in statsmodels where I used initially all
>>> > information for calculating the mean, but then we dropped some
>>> > observations to match the behavior of Stata, and to use the same
>>> > observations for calculating the mean and the follow up statistics.)
>>> >
>>> >
>>> > looks like pandas might be truncating the correlations to [-1, 1] (I
>>> > didn't check)
>>> >
>>> >>>> import pandas as pd
>>> >>>> x_pd = pd.DataFrame(x_ma.T)
>>> >>>> x_pd.corr()
>>> >           0         1         2         3
>>> > 0  1.000000  0.734367 -1.000000 -0.240192
>>> > 1  0.734367  1.000000 -0.856565 -0.378777
>>> > 2 -1.000000 -0.856565  1.000000       NaN
>>> > 3 -0.240192 -0.378777       NaN  1.000000
>>> >
>>> >>>> np.round(np.ma.corrcoef(x_ma), 6)
>>> > masked_array(data =
>>> >  [[1.0 0.734367 -1.190909 -1.681346]
>>> >  [0.734367 1.0 -0.856565 -0.378777]
>>> >  [-1.190909 -0.856565 1.0 --]
>>> >  [-1.681346 -0.378777 -- 1.0]],
>>> >              mask =
>>> >  [[False False False False]
>>> >  [False False False False]
>>> >  [False False False  True]
>>> >  [False False  True False]],
>>> >        fill_value = 1e+20)
>>> >
>>> >
>>> > Josef
>>> >
>>> >
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Faraz
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list
>>> >> NumPy-Discussion at scipy.org
>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>