np.corrcoef ddof is redundant?
Hi, I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant. I'd be very grateful if someone could verify this is true or tell me if I've missed something. Thanks, Alistair -- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721
It does change for me, though very little.... x = np.random.randn(50) y = x * x * x * x for ddof in range(20): print "ddof = {}; r = {:.20f}".format(ddof, np.corrcoef(x, y, ddof=ddof)[0, 1]) ddof = 0; r = 0.27115960925626320099 ddof = 1; r = 0.27115960925626320099 ddof = 2; r = 0.27115960925626314548 ddof = 3; r = 0.27115960925626320099 ddof = 4; r = 0.27115960925626320099 ddof = 5; r = 0.27115960925626314548 ddof = 6; r = 0.27115960925626320099 ddof = 7; r = 0.27115960925626320099 ddof = 8; r = 0.27115960925626320099 ddof = 9; r = 0.27115960925626320099 ddof = 10; r = 0.27115960925626314548 ddof = 11; r = 0.27115960925626320099 ddof = 12; r = 0.27115960925626320099 ddof = 13; r = 0.27115960925626320099 ddof = 14; r = 0.27115960925626314548 ddof = 15; r = 0.27115960925626314548 ddof = 16; r = 0.27115960925626314548 ddof = 17; r = 0.27115960925626320099 ddof = 18; r = 0.27115960925626320099 ddof = 19; r = 0.27115960925626320099 Cheers 2015-03-10 11:55 GMT-04:00 Alistair Miles <alimanfoo@googlemail.com>:
Hi,
I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.
I'd be very grateful if someone could verify this is true or tell me if I've missed something.
Thanks, Alistair -- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
-- Sasha
Alistair Miles <alimanfoo@googlemail.com> wrote:
I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.
I'd be very grateful if someone could verify this is true or tell me if I've missed something.
You are right. It should cancel out or np.corrcoef would be wrong. The sample size does not go into the Pearson product-moment correlation. Sturla
Thanks, Alistair
-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <<a href="http://cggh.org">http://cggh.org</a>> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: <a href="http://purl.org/net/aliman">http://purl.org/net/aliman</a> Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <a href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>
Hi, On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
Alistair Miles <alimanfoo@googlemail.com> wrote:
I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.
I'd be very grateful if someone could verify this is true or tell me if I've missed something.
You are right. It should cancel out or np.corrcoef would be wrong. The sample size does not go into the Pearson product-moment correlation.
Oh dear - that's embarrassing. https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient I guess we should deprecate the 'bias' and 'ddof' input arguments asap. Cheers, Matthew
Thanks for the responses, glad to know I'm not going crazy. Cheers, Alistair. On Tuesday, 10 March 2015, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <sturla.molden@gmail.com <javascript:;>> wrote:
Alistair Miles <alimanfoo@googlemail.com <javascript:;>> wrote:
I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.
I'd be very grateful if someone could verify this is true or tell me if I've missed something.
You are right. It should cancel out or np.corrcoef would be wrong. The sample size does not go into the Pearson product-moment correlation.
Oh dear - that's embarrassing.
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
I guess we should deprecate the 'bias' and 'ddof' input arguments asap.
Cheers,
Matthew _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <javascript:;> http://mail.scipy.org/mailman/listinfo/scipy-user
-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721
On 10/03/15 21:12, Matthew Brett wrote:
Oh dear - that's embarrassing.
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
I guess we should deprecate the 'bias' and 'ddof' input arguments asap.
It is an unfortunate consequence of implementing np.corrcoef on top of np.cov. np.corrcoef should not be computed with np.cov because it just adds additional rounding error to the result. https://github.com/numpy/numpy/blob/32e23a1d52a05d3a56f693010eaf8d96826db75f... Sturla
On Tue, Mar 10, 2015 at 7:21 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 10/03/15 21:12, Matthew Brett wrote:
Oh dear - that's embarrassing.
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
I guess we should deprecate the 'bias' and 'ddof' input arguments asap.
It is an unfortunate consequence of implementing np.corrcoef on top of np.cov.
Except we should have realized that bias / ddof cancels and therefore should not have implemented the bias / ddof input arguments (or passed them to cov in the function).
np.corrcoef should not be computed with np.cov because it just adds additional rounding error to the result.
What algorithm do you think we should use to minimize rounding error? Cheers, Matthew
On 11/03/15 03:56, Matthew Brett wrote:
np.corrcoef should not be computed with np.cov because it just adds additional rounding error to the result.
What algorithm do you think we should use to minimize rounding error?
I was not actually thinking about that. I just thought we could reuse some of the code from np.cov to avoid the redundant division and multiplications. But since you asked, to minimize rounding error there is a two-pass method which can be used for both cov and corrcoef. Cf. this Matlab code: http://home.online.no/~pjacklam/matlab/software/util/statutil/covmat.m This would be very easy to use in NumPy. Another method which is less known is to use the SVD. It can also be used to compute the corrcoef. Here for real values and rowvar=False: def cov(X, ddof): nx,p = X.shape mean = X.mean(axis=0) CX = X - mean[None,:] u,s,pc = np.linalg.svd(CX/np.sqrt(nx-ddof), full_matrices=False) s2 = s**2 tmp = np.eye(p) * s2[:,None] return np.dot(pc.T,np.dot(tmp,pc)) Sturla
Hi, On Tue, Mar 10, 2015 at 1:12 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Tue, Mar 10, 2015 at 9:27 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
Alistair Miles <alimanfoo@googlemail.com> wrote:
I'm trying to calculate correlation coefficients and looking at the np.corrcoef function. It has bias and ddof arguments, however when I try different values of ddof with test data the results are always the same, i.e., changing ddof has no effect. From some back-of-the-envelope algebra I reckon the n/(n-ddof) normalisations should get cancelled out when calculating correlation coefficients from a covariance matrix, and therefore the ddof (and bias) arguments to np.corrcoef are redundant.
I'd be very grateful if someone could verify this is true or tell me if I've missed something.
You are right. It should cancel out or np.corrcoef would be wrong. The sample size does not go into the Pearson product-moment correlation.
Oh dear - that's embarrassing.
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
I guess we should deprecate the 'bias' and 'ddof' input arguments asap.
https://github.com/numpy/numpy/pull/5675 Cheers, Matthew
participants (4)
-
Alistair Miles
-
Matthew Brett
-
Oleksandr Huziy
-
Sturla Molden