[Numpy-discussion] invalid correlation coefficient from np.ma.corrcoef

Thu Sep 26 07:35:56 EDT 2013

On Thu, Sep 26, 2013 at 11:51 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs at pobox.com> wrote:
>> If you want a proper self-consistent correlation/covariance matrix, then
>> pairwise deletion just makes no sense period, I don't see how postprocessing
>> can help.
>
> clipping to [-1, 1] and finding the nearest positive semi-definite matrix.
> For the latter there is some code in statsmodels, and several newer
> algorithms that I haven't looked at.

While in general there's an interesting problem for how to best
estimate a joint covariance matrix in the presence of missing data,
this sounds way beyond the scope of lowly corrcoef(). The very fact
that there is active research on the problem means that we shouldn't
be trying to pick one algorithm and declare that it's *the* correct
one to build in as a basic tool for unsophisticated users.

It's not clear that this is even what people want corrcoef to do in
general. I always use it as just a convenient way to estimate lots of
pairwise correlations, not as a way to estimate a joint correlation
matrix...

> It's a quite common problem in finance, but usually longer time series
> with not a large number of missing values.
>
>>
>> If you want a matrix of correlations, then pairwise deletion makes sense.
>> It's an interesting point that arguably the current ma.corrcoef code may
>> actually give you a better estimator of the individual correlation
>> coefficients than doing full pairwise deletion, but it's pretty surprising
>> and unexpected... when people call corrcoef I think they are asking "please
>> compute the textbook formula for 'sample correlation'" not "please give me
>> some arbitrary good estimator for the population correlation", so we
>> probably have to change it.
>>
>> (Hopefully no-one has published anything based on the current code.)
>
> I haven't seen a textbook version of this yet.

By textbook I mean, users expect corrcoef to use this formula, which
is printed in every textbook:
  https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample
The vast majority of people using correlations think that "sample
correlation" justs mean this number, not "some arbitrary finite-sample
estimator of the underlying population correlation". So the obvious
interpretation of pairwise correlations is that you apply that formula
to each set of pairwise complete observations.

> Calculating every mean (n + 1) * n / 2 times sounds a bit excessive,
> especially if it doesn't really solve the problem.

I don't understand what's "excessive" about calculating every
mean/stddev (n + 1)*n/2 times. By that logic one should never call
corrcoef at all, because calculating (n + 1)*n/2 covariances is also
excessive :-). It's not like we're talking about some massive
slowdown.

>>> > looks like pandas might be truncating the correlations to [-1, 1] (I
>>> > didn't check)
>>> >
>>> >>>> import pandas as pd
>>> >>>> x_pd = pd.DataFrame(x_ma.T)
>>> >>>> x_pd.corr()
>>> >           0         1         2         3
>>> > 0  1.000000  0.734367 -1.000000 -0.240192
>>> > 1  0.734367  1.000000 -0.856565 -0.378777
>>> > 2 -1.000000 -0.856565  1.000000       NaN
>>> > 3 -0.240192 -0.378777       NaN  1.000000
>>> >
>>> >>>> np.round(np.ma.corrcoef(x_ma), 6)
>>> > masked_array(data =
>>> >  [[1.0 0.734367 -1.190909 -1.681346]
>>> >  [0.734367 1.0 -0.856565 -0.378777]
>>> >  [-1.190909 -0.856565 1.0 --]
>>> >  [-1.681346 -0.378777 -- 1.0]],

That can't be truncation -- where corrcoef has -1.68, pandas has
-0.24, not -1.0.

R gives the same result as pandas (except that by default it
propagates NA and give a range of options for handling NAs, so you
have to explicitly request pairwise complete if you want it, and they
make this the most annoying option to type ;-), and the help page
explicitly warns that this "can result in covariance or correlation
matrices which are not positive semi-definite"):

> x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, 5, -8, 0, 1, 4)
> x <- matrix(x, nrow=4, byrow=T)
> cor(t(x), use="pairwise.complete.obs")
           [,1]        [,2]        [,3]       [,4]
[1,]  1.0000000  0.43330535 -0.22749669 -0.5246782
[2,]  0.4333053  1.00000000 -0.01829124 -0.2120425
[3,] -0.2274967 -0.01829124  1.00000000 -0.6423032
[4,] -0.5246782 -0.21204248 -0.64230323  1.0000000

-n