On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Sep 26, 2013 at 11:51 AM, <josef.pktd@gmail.com> wrote:
On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at.
While in general there's an interesting problem for how to best estimate a joint covariance matrix in the presence of missing data, this sounds way beyond the scope of lowly corrcoef(). The very fact that there is active research on the problem means that we shouldn't be trying to pick one algorithm and declare that it's *the* correct one to build in as a basic tool for unsophisticated users.
It's not clear that this is even what people want corrcoef to do in general. I always use it as just a convenient way to estimate lots of pairwise correlations, not as a way to estimate a joint correlation matrix...
I don't expect that corrcoef or cov should provide a positive semidefinite approximation. I just wanted to point out what users can do with the return from corrcoef or cov if they want a proper correlation or covariance matrix. (same discussion with pandas, post-processing is a separate step.)
It's a quite common problem in finance, but usually longer time series with not a large number of missing values.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it.
(Hopefully no-one has published anything based on the current code.)
I haven't seen a textbook version of this yet.
By textbook I mean, users expect corrcoef to use this formula, which is printed in every textbook: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient... The vast majority of people using correlations think that "sample correlation" justs mean this number, not "some arbitrary finite-sample estimator of the underlying population correlation". So the obvious interpretation of pairwise correlations is that you apply that formula to each set of pairwise complete observations.
This textbook version **assumes** that we have the same observations for all/both variables, and doesn't say what to do if not. I'm usually mainly interested the covariance/correlation matrix for estimating some underlying population or model parameters or do hypothesis tests with them. I just wanted to point out that there is no "obvious" ("There should be one-- ...") way to define pairwise deletion correlation matrices. But maybe just doing a loop [corrcoef(x, y) for x in data for y in data] still makes the most sense. Dunno
Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem.
I don't understand what's "excessive" about calculating every mean/stddev (n + 1)*n/2 times. By that logic one should never call corrcoef at all, because calculating (n + 1)*n/2 covariances is also excessive :-). It's not like we're talking about some massive slowdown.
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
>> import pandas as pd >> x_pd = pd.DataFrame(x_ma.T) >> x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
>> np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]],
That can't be truncation -- where corrcoef has -1.68, pandas has -0.24, not -1.0.
my mistake, for not reading carefully enough Josef
R gives the same result as pandas (except that by default it propagates NA and give a range of options for handling NAs, so you have to explicitly request pairwise complete if you want it, and they make this the most annoying option to type ;-), and the help page explicitly warns that this "can result in covariance or correlation matrices which are not positive semi-definite"):
x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, 5, -8, 0, 1, 4) x <- matrix(x, nrow=4, byrow=T) cor(t(x), use="pairwise.complete.obs") [,1] [,2] [,3] [,4] [1,] 1.0000000 0.43330535 -0.22749669 -0.5246782 [2,] 0.4333053 1.00000000 -0.01829124 -0.2120425 [3,] -0.2274967 -0.01829124 1.00000000 -0.6423032 [4,] -0.5246782 -0.21204248 -0.64230323 1.0000000
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion