[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Thu Sep 26 12:28:54 EDT 2013

On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On Thu, Sep 26, 2013 at 11:51 AM,  <josef.pktd at gmail.com> wrote:
>> On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs at pobox.com> wrote:
>>> If you want a proper self-consistent correlation/covariance matrix, then
>>> pairwise deletion just makes no sense period, I don't see how postprocessing
>>> can help.
>>
>> clipping to [-1, 1] and finding the nearest positive semi-definite matrix.
>> For the latter there is some code in statsmodels, and several newer
>> algorithms that I haven't looked at.
>
> While in general there's an interesting problem for how to best
> estimate a joint covariance matrix in the presence of missing data,
> this sounds way beyond the scope of lowly corrcoef(). The very fact
> that there is active research on the problem means that we shouldn't
> be trying to pick one algorithm and declare that it's *the* correct
> one to build in as a basic tool for unsophisticated users.
>
> It's not clear that this is even what people want corrcoef to do in
> general. I always use it as just a convenient way to estimate lots of
> pairwise correlations, not as a way to estimate a joint correlation
> matrix...

I don't expect that corrcoef or cov should provide a positive
semidefinite approximation. I just wanted to point out what users can
do with the return from corrcoef or cov if they want a proper
correlation or covariance matrix.
(same discussion with pandas, post-processing is a separate step.)

>
>> It's a quite common problem in finance, but usually longer time series
>> with not a large number of missing values.
>>
>>>
>>> If you want a matrix of correlations, then pairwise deletion makes sense.
>>> It's an interesting point that arguably the current ma.corrcoef code may
>>> actually give you a better estimator of the individual correlation
>>> coefficients than doing full pairwise deletion, but it's pretty surprising
>>> and unexpected... when people call corrcoef I think they are asking "please
>>> compute the textbook formula for 'sample correlation'" not "please give me
>>> some arbitrary good estimator for the population correlation", so we
>>> probably have to change it.
>>>
>>> (Hopefully no-one has published anything based on the current code.)
>>
>> I haven't seen a textbook version of this yet.
>
> By textbook I mean, users expect corrcoef to use this formula, which
> is printed in every textbook:
>   https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
> The vast majority of people using correlations think that "sample
> correlation" justs mean this number, not "some arbitrary finite-sample
> estimator of the underlying population correlation". So the obvious
> interpretation of pairwise correlations is that you apply that formula
> to each set of pairwise complete observations.

This textbook version **assumes** that we have the same observations
for all/both variables, and doesn't say what to do if not.
I'm usually mainly interested the covariance/correlation matrix for
estimating some underlying population or model parameters or do
hypothesis tests with them.

I just wanted to point out that there is no "obvious" ("There should
be one-- ...") way to define pairwise deletion correlation matrices.
But maybe just doing a loop [corrcoef(x, y) for x in data for y in
data] still makes the most sense.   Dunno

>
>> Calculating every mean (n + 1) * n / 2 times sounds a bit excessive,
>> especially if it doesn't really solve the problem.
>
> I don't understand what's "excessive" about calculating every
> mean/stddev (n + 1)*n/2 times. By that logic one should never call
> corrcoef at all, because calculating (n + 1)*n/2 covariances is also
> excessive :-). It's not like we're talking about some massive
> slowdown.
>
>>>> > looks like pandas might be truncating the correlations to [-1, 1] (I
>>>> > didn't check)
>>>> >
>>>> >>>> import pandas as pd
>>>> >>>> x_pd = pd.DataFrame(x_ma.T)
>>>> >>>> x_pd.corr()
>>>> >           0         1         2         3
>>>> > 0  1.000000  0.734367 -1.000000 -0.240192
>>>> > 1  0.734367  1.000000 -0.856565 -0.378777
>>>> > 2 -1.000000 -0.856565  1.000000       NaN
>>>> > 3 -0.240192 -0.378777       NaN  1.000000
>>>> >
>>>> >>>> np.round(np.ma.corrcoef(x_ma), 6)
>>>> > masked_array(data =
>>>> >  [[1.0 0.734367 -1.190909 -1.681346]
>>>> >  [0.734367 1.0 -0.856565 -0.378777]
>>>> >  [-1.190909 -0.856565 1.0 --]
>>>> >  [-1.681346 -0.378777 -- 1.0]],
>
> That can't be truncation -- where corrcoef has -1.68, pandas has
> -0.24, not -1.0.

my mistake, for not reading carefully enough

Josef

>
> R gives the same result as pandas (except that by default it
> propagates NA and give a range of options for handling NAs, so you
> have to explicitly request pairwise complete if you want it, and they
> make this the most annoying option to type ;-), and the help page
> explicitly warns that this "can result in covariance or correlation
> matrices which are not positive semi-definite"):
>
>> x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, 5, -8, 0, 1, 4)
>> x <- matrix(x, nrow=4, byrow=T)
>> cor(t(x), use="pairwise.complete.obs")
>            [,1]        [,2]        [,3]       [,4]
> [1,]  1.0000000  0.43330535 -0.22749669 -0.5246782
> [2,]  0.4333053  1.00000000 -0.01829124 -0.2120425
> [3,] -0.2274967 -0.01829124  1.00000000 -0.6423032
> [4,] -0.5246782 -0.21204248 -0.64230323  1.0000000
>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion