invalid correlation coefficient from np.ma.corrcoef

Hi everyone, I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1. Here's an example: x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]]) x_ma = np.ma.masked_less_equal(x , -5) C = np.round(np.ma.corrcoef(x_ma), 2) print C [[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]] As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1). I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed. I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values. Please let me know what you think. Thanks, Faraz

On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei@gmail.com> wrote:
Hi everyone,
I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.
Here's an example:
x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]])
x_ma = np.ma.masked_less_equal(x , -5)
C = np.round(np.ma.corrcoef(x_ma), 2)
print C
[[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]]
As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).
I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed.
I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.
Please let me know what you think.
just general comments, I have no experience here
From what you are saying it sounds like np.ma is not doing pairwise deletion in calculating the mean (which only requires ignoring missings in one array), however it does (correctly) do pairwise deletion in calculating the cross product.
covariance or correlation matrices with pairwise deletion are not necessarily "proper" covariance or correlation matrices. I've read that they don't need to be positive semi-definite, but I've never heard of values outside of [-1, 1]. It might only be a problem if you have a large fraction of missing values.. I think the current behavior in np.ma makes sense in that it uses all the information available in estimating the mean, which should be more accurate if we use more information. But it makes cov and corrcoef even weirder than they already are with pairwise deletion. Row-wise deletion (deleting observations that have at least one missing), which would create "proper" correlation matrices, wouldn't produce much in your example. I would check what R or other packages are doing and follow their lead, or add another option. (similar: we had a case in statsmodels where I used initially all information for calculating the mean, but then we dropped some observations to match the behavior of Stata, and to use the same observations for calculating the mean and the follow up statistics.) looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
import pandas as pd x_pd = pd.DataFrame(x_ma.T) x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]], mask = [[False False False False] [False False False False] [False False False True] [False False True False]], fill_value = 1e+20)
Josef
Thanks,
Faraz
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Sep 25, 2013 at 11:05 PM, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei@gmail.com> wrote:
Hi everyone,
I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.
Here's an example:
x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]])
x_ma = np.ma.masked_less_equal(x , -5)
C = np.round(np.ma.corrcoef(x_ma), 2)
print C
[[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]]
As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).
I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed.
I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.
Please let me know what you think.
just general comments, I have no experience here
From what you are saying it sounds like np.ma is not doing pairwise deletion in calculating the mean (which only requires ignoring missings in one array), however it does (correctly) do pairwise deletion in calculating the cross product.
Actually, I think the calculation of the mean is not relevant for having weird correlation coefficients without clipping. With pairwise deletion you use different samples, subsets of the data, for the variances and the covariances. It should be easy (?) to construct examples where the pairwise deletion for the covariance produces a large positive or negative number, and both variances and standard deviations are small, using two different subsamples. Once you calculate the correlation coefficient, it could be all over the place, independent of the mean calculations. conclusion: pairwise deletion requires post-processing if you want a proper correlation matrix. Josef
covariance or correlation matrices with pairwise deletion are not necessarily "proper" covariance or correlation matrices. I've read that they don't need to be positive semi-definite, but I've never heard of values outside of [-1, 1]. It might only be a problem if you have a large fraction of missing values..
I think the current behavior in np.ma makes sense in that it uses all the information available in estimating the mean, which should be more accurate if we use more information. But it makes cov and corrcoef even weirder than they already are with pairwise deletion.
Row-wise deletion (deleting observations that have at least one missing), which would create "proper" correlation matrices, wouldn't produce much in your example.
I would check what R or other packages are doing and follow their lead, or add another option. (similar: we had a case in statsmodels where I used initially all information for calculating the mean, but then we dropped some observations to match the behavior of Stata, and to use the same observations for calculating the mean and the follow up statistics.)
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
import pandas as pd x_pd = pd.DataFrame(x_ma.T) x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]], mask = [[False False False False] [False False False False] [False False False True] [False False True False]], fill_value = 1e+20)
Josef
Thanks,
Faraz
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help. If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it. (Hopefully no-one has published anything based on the current code.) -n On 26 Sep 2013 04:19, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 11:05 PM, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei@gmail.com> wrote:
Hi everyone,
I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.
Here's an example:
x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]])
x_ma = np.ma.masked_less_equal(x , -5)
C = np.round(np.ma.corrcoef(x_ma), 2)
print C
[[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]]
As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).
I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed.
I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.
Please let me know what you think.
just general comments, I have no experience here
From what you are saying it sounds like np.ma is not doing pairwise deletion in calculating the mean (which only requires ignoring missings in one array), however it does (correctly) do pairwise deletion in calculating the cross product.
Actually, I think the calculation of the mean is not relevant for having weird correlation coefficients without clipping.
With pairwise deletion you use different samples, subsets of the data, for the variances and the covariances. It should be easy (?) to construct examples where the pairwise deletion for the covariance produces a large positive or negative number, and both variances and standard deviations are small, using two different subsamples. Once you calculate the correlation coefficient, it could be all over the place, independent of the mean calculations.
conclusion: pairwise deletion requires post-processing if you want a proper correlation matrix.
Josef
covariance or correlation matrices with pairwise deletion are not necessarily "proper" covariance or correlation matrices. I've read that they don't need to be positive semi-definite, but I've never heard of values outside of [-1, 1]. It might only be a problem if you have a large fraction of missing values..
I think the current behavior in np.ma makes sense in that it uses all the information available in estimating the mean, which should be more accurate if we use more information. But it makes cov and corrcoef even weirder than they already are with pairwise deletion.
Row-wise deletion (deleting observations that have at least one missing), which would create "proper" correlation matrices, wouldn't produce much in your example.
I would check what R or other packages are doing and follow their lead, or add another option. (similar: we had a case in statsmodels where I used initially all information for calculating the mean, but then we dropped some observations to match the behavior of Stata, and to use the same observations for calculating the mean and the follow up statistics.)
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
import pandas as pd x_pd = pd.DataFrame(x_ma.T) x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]], mask = [[False False False False] [False False False False] [False False False True] [False False True False]], fill_value = 1e+20)
Josef
Thanks,
Faraz
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at. It's a quite common problem in finance, but usually longer time series with not a large number of missing values.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it.
(Hopefully no-one has published anything based on the current code.)
I haven't seen a textbook version of this yet. Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem. Josef
-n
On 26 Sep 2013 04:19, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 11:05 PM, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei@gmail.com> wrote:
Hi everyone,
I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.
Here's an example:
x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]])
x_ma = np.ma.masked_less_equal(x , -5)
C = np.round(np.ma.corrcoef(x_ma), 2)
print C
[[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]]
As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).
I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed.
I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.
Please let me know what you think.
just general comments, I have no experience here
From what you are saying it sounds like np.ma is not doing pairwise deletion in calculating the mean (which only requires ignoring missings in one array), however it does (correctly) do pairwise deletion in calculating the cross product.
Actually, I think the calculation of the mean is not relevant for having weird correlation coefficients without clipping.
With pairwise deletion you use different samples, subsets of the data, for the variances and the covariances. It should be easy (?) to construct examples where the pairwise deletion for the covariance produces a large positive or negative number, and both variances and standard deviations are small, using two different subsamples. Once you calculate the correlation coefficient, it could be all over the place, independent of the mean calculations.
conclusion: pairwise deletion requires post-processing if you want a proper correlation matrix.
Josef
covariance or correlation matrices with pairwise deletion are not necessarily "proper" covariance or correlation matrices. I've read that they don't need to be positive semi-definite, but I've never heard of values outside of [-1, 1]. It might only be a problem if you have a large fraction of missing values..
I think the current behavior in np.ma makes sense in that it uses all the information available in estimating the mean, which should be more accurate if we use more information. But it makes cov and corrcoef even weirder than they already are with pairwise deletion.
Row-wise deletion (deleting observations that have at least one missing), which would create "proper" correlation matrices, wouldn't produce much in your example.
I would check what R or other packages are doing and follow their lead, or add another option. (similar: we had a case in statsmodels where I used initially all information for calculating the mean, but then we dropped some observations to match the behavior of Stata, and to use the same observations for calculating the mean and the follow up statistics.)
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
import pandas as pd x_pd = pd.DataFrame(x_ma.T) x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]], mask = [[False False False False] [False False False False] [False False False True] [False False True False]], fill_value = 1e+20)
Josef
Thanks,
Faraz
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Sep 26, 2013 at 6:51 AM, <josef.pktd@gmail.com> wrote:
On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at.
It's a quite common problem in finance, but usually longer time series with not a large number of missing values.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it.
(Hopefully no-one has published anything based on the current code.)
I haven't seen a textbook version of this yet.
Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem.
unless you also calculate each standard deviation (n + 1) * n / 2 times. But then you loose the relationship between cov and corrcoeff. Josef
Josef
-n
On 26 Sep 2013 04:19, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 11:05 PM, <josef.pktd@gmail.com> wrote:
On Wed, Sep 25, 2013 at 8:26 PM, Faraz Mirzaei <fmmirzaei@gmail.com> wrote:
Hi everyone,
I'm using np.ma.corrcoef to compute the correlation coefficients among rows of a masked matrix, where the masked elements are the missing data. I've observed that in some cases, the np.ma.corrcoef gives invalid coefficients that are greater than 1 or less than -1.
Here's an example:
x = array([[ 7, -4, -1, -7, -3, -2], [ 6, -3, 0, 4, 0, 5], [-4, -5, 7, 5, -7, -7], [-5, 5, -8, 0, 1, 4]])
x_ma = np.ma.masked_less_equal(x , -5)
C = np.round(np.ma.corrcoef(x_ma), 2)
print C
[[1.0 0.73 -- -1.68] [0.73 1.0 -0.86 -0.38] [-- -0.86 1.0 --] [-1.68 -0.38 -- 1.0]]
As you can see, the [0,3] element is -1.68 which is not a valid correlation coefficient. (Valid correlation coefficients should be between -1 and 1).
I looked at the code for np.ma.corrcoef, and this behavior seems to be due to the way that mean values of the rows of the input matrix are computed and subtracted from them. Apparently, the mean value is individually computed for each row, without masking the elements corresponding to the masked elements of the other row of the matrix, with respect to which the correlation coefficient is being computed.
I guess the right way should be to recompute the mean value for each row every time that a correlation coefficient is being computed for two rows after propagating pair-wise masked values.
Please let me know what you think.
just general comments, I have no experience here
From what you are saying it sounds like np.ma is not doing pairwise deletion in calculating the mean (which only requires ignoring missings in one array), however it does (correctly) do pairwise deletion in calculating the cross product.
Actually, I think the calculation of the mean is not relevant for having weird correlation coefficients without clipping.
With pairwise deletion you use different samples, subsets of the data, for the variances and the covariances. It should be easy (?) to construct examples where the pairwise deletion for the covariance produces a large positive or negative number, and both variances and standard deviations are small, using two different subsamples. Once you calculate the correlation coefficient, it could be all over the place, independent of the mean calculations.
conclusion: pairwise deletion requires post-processing if you want a proper correlation matrix.
Josef
covariance or correlation matrices with pairwise deletion are not necessarily "proper" covariance or correlation matrices. I've read that they don't need to be positive semi-definite, but I've never heard of values outside of [-1, 1]. It might only be a problem if you have a large fraction of missing values..
I think the current behavior in np.ma makes sense in that it uses all the information available in estimating the mean, which should be more accurate if we use more information. But it makes cov and corrcoef even weirder than they already are with pairwise deletion.
Row-wise deletion (deleting observations that have at least one missing), which would create "proper" correlation matrices, wouldn't produce much in your example.
I would check what R or other packages are doing and follow their lead, or add another option. (similar: we had a case in statsmodels where I used initially all information for calculating the mean, but then we dropped some observations to match the behavior of Stata, and to use the same observations for calculating the mean and the follow up statistics.)
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
> import pandas as pd > x_pd = pd.DataFrame(x_ma.T) > x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
> np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]], mask = [[False False False False] [False False False False] [False False False True] [False False True False]], fill_value = 1e+20)
Josef
Thanks,
Faraz
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Sep 26, 2013 at 11:51 AM, <josef.pktd@gmail.com> wrote:
On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at.
While in general there's an interesting problem for how to best estimate a joint covariance matrix in the presence of missing data, this sounds way beyond the scope of lowly corrcoef(). The very fact that there is active research on the problem means that we shouldn't be trying to pick one algorithm and declare that it's *the* correct one to build in as a basic tool for unsophisticated users. It's not clear that this is even what people want corrcoef to do in general. I always use it as just a convenient way to estimate lots of pairwise correlations, not as a way to estimate a joint correlation matrix...
It's a quite common problem in finance, but usually longer time series with not a large number of missing values.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it.
(Hopefully no-one has published anything based on the current code.)
I haven't seen a textbook version of this yet.
By textbook I mean, users expect corrcoef to use this formula, which is printed in every textbook: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient... The vast majority of people using correlations think that "sample correlation" justs mean this number, not "some arbitrary finite-sample estimator of the underlying population correlation". So the obvious interpretation of pairwise correlations is that you apply that formula to each set of pairwise complete observations.
Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem.
I don't understand what's "excessive" about calculating every mean/stddev (n + 1)*n/2 times. By that logic one should never call corrcoef at all, because calculating (n + 1)*n/2 covariances is also excessive :-). It's not like we're talking about some massive slowdown.
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
> import pandas as pd > x_pd = pd.DataFrame(x_ma.T) > x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
> np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]],
That can't be truncation -- where corrcoef has -1.68, pandas has -0.24, not -1.0. R gives the same result as pandas (except that by default it propagates NA and give a range of options for handling NAs, so you have to explicitly request pairwise complete if you want it, and they make this the most annoying option to type ;-), and the help page explicitly warns that this "can result in covariance or correlation matrices which are not positive semi-definite"):
x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, 5, -8, 0, 1, 4) x <- matrix(x, nrow=4, byrow=T) cor(t(x), use="pairwise.complete.obs") [,1] [,2] [,3] [,4] [1,] 1.0000000 0.43330535 -0.22749669 -0.5246782 [2,] 0.4333053 1.00000000 -0.01829124 -0.2120425 [3,] -0.2274967 -0.01829124 1.00000000 -0.6423032 [4,] -0.5246782 -0.21204248 -0.64230323 1.0000000
-n

On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Sep 26, 2013 at 11:51 AM, <josef.pktd@gmail.com> wrote:
On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <njs@pobox.com> wrote:
If you want a proper self-consistent correlation/covariance matrix, then pairwise deletion just makes no sense period, I don't see how postprocessing can help.
clipping to [-1, 1] and finding the nearest positive semi-definite matrix. For the latter there is some code in statsmodels, and several newer algorithms that I haven't looked at.
While in general there's an interesting problem for how to best estimate a joint covariance matrix in the presence of missing data, this sounds way beyond the scope of lowly corrcoef(). The very fact that there is active research on the problem means that we shouldn't be trying to pick one algorithm and declare that it's *the* correct one to build in as a basic tool for unsophisticated users.
It's not clear that this is even what people want corrcoef to do in general. I always use it as just a convenient way to estimate lots of pairwise correlations, not as a way to estimate a joint correlation matrix...
I don't expect that corrcoef or cov should provide a positive semidefinite approximation. I just wanted to point out what users can do with the return from corrcoef or cov if they want a proper correlation or covariance matrix. (same discussion with pandas, post-processing is a separate step.)
It's a quite common problem in finance, but usually longer time series with not a large number of missing values.
If you want a matrix of correlations, then pairwise deletion makes sense. It's an interesting point that arguably the current ma.corrcoef code may actually give you a better estimator of the individual correlation coefficients than doing full pairwise deletion, but it's pretty surprising and unexpected... when people call corrcoef I think they are asking "please compute the textbook formula for 'sample correlation'" not "please give me some arbitrary good estimator for the population correlation", so we probably have to change it.
(Hopefully no-one has published anything based on the current code.)
I haven't seen a textbook version of this yet.
By textbook I mean, users expect corrcoef to use this formula, which is printed in every textbook: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient... The vast majority of people using correlations think that "sample correlation" justs mean this number, not "some arbitrary finite-sample estimator of the underlying population correlation". So the obvious interpretation of pairwise correlations is that you apply that formula to each set of pairwise complete observations.
This textbook version **assumes** that we have the same observations for all/both variables, and doesn't say what to do if not. I'm usually mainly interested the covariance/correlation matrix for estimating some underlying population or model parameters or do hypothesis tests with them. I just wanted to point out that there is no "obvious" ("There should be one-- ...") way to define pairwise deletion correlation matrices. But maybe just doing a loop [corrcoef(x, y) for x in data for y in data] still makes the most sense. Dunno
Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, especially if it doesn't really solve the problem.
I don't understand what's "excessive" about calculating every mean/stddev (n + 1)*n/2 times. By that logic one should never call corrcoef at all, because calculating (n + 1)*n/2 covariances is also excessive :-). It's not like we're talking about some massive slowdown.
looks like pandas might be truncating the correlations to [-1, 1] (I didn't check)
>> import pandas as pd >> x_pd = pd.DataFrame(x_ma.T) >> x_pd.corr() 0 1 2 3 0 1.000000 0.734367 -1.000000 -0.240192 1 0.734367 1.000000 -0.856565 -0.378777 2 -1.000000 -0.856565 1.000000 NaN 3 -0.240192 -0.378777 NaN 1.000000
>> np.round(np.ma.corrcoef(x_ma), 6) masked_array(data = [[1.0 0.734367 -1.190909 -1.681346] [0.734367 1.0 -0.856565 -0.378777] [-1.190909 -0.856565 1.0 --] [-1.681346 -0.378777 -- 1.0]],
That can't be truncation -- where corrcoef has -1.68, pandas has -0.24, not -1.0.
my mistake, for not reading carefully enough Josef
R gives the same result as pandas (except that by default it propagates NA and give a range of options for handling NAs, so you have to explicitly request pairwise complete if you want it, and they make this the most annoying option to type ;-), and the help page explicitly warns that this "can result in covariance or correlation matrices which are not positive semi-definite"):
x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, 5, -8, 0, 1, 4) x <- matrix(x, nrow=4, byrow=T) cor(t(x), use="pairwise.complete.obs") [,1] [,2] [,3] [,4] [1,] 1.0000000 0.43330535 -0.22749669 -0.5246782 [2,] 0.4333053 1.00000000 -0.01829124 -0.2120425 [3,] -0.2274967 -0.01829124 1.00000000 -0.6423032 [4,] -0.5246782 -0.21204248 -0.64230323 1.0000000
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 26 Sep 2013 17:32, <josef.pktd@gmail.com> wrote:
On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <njs@pobox.com> wrote:
By textbook I mean, users expect corrcoef to use this formula, which is printed in every textbook:
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient...
The vast majority of people using correlations think that "sample correlation" justs mean this number, not "some arbitrary finite-sample estimator of the underlying population correlation". So the obvious interpretation of pairwise correlations is that you apply that formula to each set of pairwise complete observations.
This textbook version **assumes** that we have the same observations for all/both variables, and doesn't say what to do if not. I'm usually mainly interested the covariance/correlation matrix for estimating some underlying population or model parameters or do hypothesis tests with them.
I just wanted to point out that there is no "obvious" ("There should be one-- ...") way to define pairwise deletion correlation matrices.
Yeah, fair enough.
But maybe just doing a loop [corrcoef(x, y) for x in data for y in data] still makes the most sense. Dunno
I'm not 100% sure what the best answer is either, but it seems we agree that these are the only reasonable options: (1) refuse to give correlations if there are missing values (2) the pairwise version pandas/R do (3) maybe something in between (like only including fully complete rows, or giving an option to pick between these) But the key thing here is that the current behaviour is definitely *wrong* and misleading people, so we better do something about that. (And if no one pops up to fix it maybe we should just remove the function entirely from 1.8, because numerically wrong answers are Serious Business?) -n
participants (3)
-
Faraz Mirzaei
-
josef.pktd@gmail.com
-
Nathaniel Smith