Combining covariance and correlation coefficient into one numpy.cov call

Hi all, I posted a pull request: https://github.com/numpy/numpy/pull/8211 which adds a function `numpy.covcorr` that calculates both the covariance matrix and correlation coefficient with a single call to `numpy.cov` (which is often an expensive call for large data-sets). A function `numpy.covtocorr` has also been added that converts a covariance matrix to a correlation coefficent, and `numpy.corrcoef` has been modified to call this. The motivation here is that one often needs the covariance for subsequent analysis and the correlation coefficient for visualization, so instead of forcing the user to write their own code to convert one to the other, we want to allow both to be obtained from `numpy` as efficiently as possible. Best, Mathew

I wonder if the goals of this addition could be achieved by simply adding an optional `cov` argument to np.corr, which would provide a pre-computed covariance. Either way, `covcorr` feels like a helper function that could exist in user code rather than numpy proper. On Wed, Oct 26, 2016 at 10:27 AM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:

On Wed, Oct 26, 2016 at 1:46 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I wonder if the goals of this addition could be achieved by simply adding an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates the need for two new functions. I'll add an optional `cov = False` argument to numpy.corrcoef that returns a tuple (corr, cov) instead.
Either way, `covcorr` feels like a helper function that could exist in user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance matrix to a correlation coefficient. I made this PR to avoid that code duplication. Mathew

On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:
With the API I was envisioning (or even your proposed API, for that matter), this function would only be a few lines, e.g., def covcorr(x): cov = np.cov(x) corr = np.corrcoef(x, cov=cov) return (cov, corr) Generally, functions this short should be provided as recipes (if at all) rather than be added to numpy proper, unless the need for them is extremely common.

On Wed, Oct 26, 2016 at 2:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
Ah, I see what you were suggesting now. I agree that a function like covcorr need not be provided by numpy itself, but it would be tremendously useful if a pre-computed covariance could be provided to np.corrcoef. I can update this PR to just add `cov = None` to numpy.corrcoef and do an `if cov is not None` before calculating the covariance. Note however that in the case that `cov` is specified for np.corrcoef, the non-optional `x` argument is redundant.

On Wed, Oct 26, 2016 at 11:13 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
IIUC, if you have a covariance matrix then you can compute the correlation matrix directly, without looking at 'x', so corrcoef(x, cov=cov) is a bit odd-looking. I think probably the API that makes the most sense is just to expose something like the covtocorr function (maybe it could have a less telegraphic name?)? And then, yeah, users can use that to build their own covcorr or whatever if they want it. -n -- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 26, 2016 at 2:56 PM, Nathaniel Smith <njs@pobox.com> wrote:
Right, agreed, this is why I said `x` becomes redundant when `cov` is specified when calling `numpy.corrcoef`. So we have two alternatives: 1) Have `np.corrcoef` accept a boolean optional argument `covmat = False` that lets one obtain a tuple containing the covariance and the correlation matrices in the same call 2) Modify my original PR so that `np.covtocorr` remains (with possibly a better name) but remove `np.covcorr` since this is easy for the user to add. My preference is option 2.

On Wed, Oct 26, 2016 at 3:11 PM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:
cov2corr is a useful function http://www.statsmodels.org/dev/generated/statsmodels.stats.moment_helpers.co... I also wrote the inverse function corr2cov, but AFAIR use it only in some test cases. I don't think adding any of the options to corrcoef or covcor is useful since there is no computational advantage to it. What I'm missing are functions that return the intermediate results, e.g. var and mean or cov and mean. (For statsmodels I decided to return mean and cov or mean and var in the related functions. Some R packages return the mean as an option.) Josef

On Wed, Oct 26, 2016 at 3:20 PM, <josef.pktd@gmail.com> wrote:
I'm not sure I agree with that statement. If a user wants to calculate both a covariance and correlation matrix, they currently have two options: A) Call np.cov and np.corrcoef separately, which takes at least twice as long as one call to np.cov. For data-sets that I am used to, a np.cov call takes 5-10 seconds. B) Call np.cov and then separately implement their own correlation matrix code, which means the user isn't able to fully take advantage of code that is already in numpy. In any case, I've updated the PR: https://github.com/numpy/numpy/pull/8211 Relative to my original PR, it: a) removes the numpy.covcorr function which the user can easily implement b) have numpy.cov2corr be the function exposed in the API (previously called numpy.covtocorr in the PR), which accepts a pre-calculated covariance matrix c) have numpy.corrcoef call numpy.cov2corr

I wonder if the goals of this addition could be achieved by simply adding an optional `cov` argument to np.corr, which would provide a pre-computed covariance. Either way, `covcorr` feels like a helper function that could exist in user code rather than numpy proper. On Wed, Oct 26, 2016 at 10:27 AM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:

On Wed, Oct 26, 2016 at 1:46 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
I wonder if the goals of this addition could be achieved by simply adding an optional `cov` argument
to np.corr, which would provide a pre-computed covariance.
That's a fair suggestion which I'm happy to switch to. This eliminates the need for two new functions. I'll add an optional `cov = False` argument to numpy.corrcoef that returns a tuple (corr, cov) instead.
Either way, `covcorr` feels like a helper function that could exist in user code rather than numpy proper.
The user would have to re-implement the part that converts the covariance matrix to a correlation coefficient. I made this PR to avoid that code duplication. Mathew

On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:
With the API I was envisioning (or even your proposed API, for that matter), this function would only be a few lines, e.g., def covcorr(x): cov = np.cov(x) corr = np.corrcoef(x, cov=cov) return (cov, corr) Generally, functions this short should be provided as recipes (if at all) rather than be added to numpy proper, unless the need for them is extremely common.

On Wed, Oct 26, 2016 at 2:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
Ah, I see what you were suggesting now. I agree that a function like covcorr need not be provided by numpy itself, but it would be tremendously useful if a pre-computed covariance could be provided to np.corrcoef. I can update this PR to just add `cov = None` to numpy.corrcoef and do an `if cov is not None` before calculating the covariance. Note however that in the case that `cov` is specified for np.corrcoef, the non-optional `x` argument is redundant.

On Wed, Oct 26, 2016 at 11:13 AM, Stephan Hoyer <shoyer@gmail.com> wrote:
IIUC, if you have a covariance matrix then you can compute the correlation matrix directly, without looking at 'x', so corrcoef(x, cov=cov) is a bit odd-looking. I think probably the API that makes the most sense is just to expose something like the covtocorr function (maybe it could have a less telegraphic name?)? And then, yeah, users can use that to build their own covcorr or whatever if they want it. -n -- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 26, 2016 at 2:56 PM, Nathaniel Smith <njs@pobox.com> wrote:
Right, agreed, this is why I said `x` becomes redundant when `cov` is specified when calling `numpy.corrcoef`. So we have two alternatives: 1) Have `np.corrcoef` accept a boolean optional argument `covmat = False` that lets one obtain a tuple containing the covariance and the correlation matrices in the same call 2) Modify my original PR so that `np.covtocorr` remains (with possibly a better name) but remove `np.covcorr` since this is easy for the user to add. My preference is option 2.

On Wed, Oct 26, 2016 at 3:11 PM, Mathew S. Madhavacheril < mathewsyriac@gmail.com> wrote:
cov2corr is a useful function http://www.statsmodels.org/dev/generated/statsmodels.stats.moment_helpers.co... I also wrote the inverse function corr2cov, but AFAIR use it only in some test cases. I don't think adding any of the options to corrcoef or covcor is useful since there is no computational advantage to it. What I'm missing are functions that return the intermediate results, e.g. var and mean or cov and mean. (For statsmodels I decided to return mean and cov or mean and var in the related functions. Some R packages return the mean as an option.) Josef

On Wed, Oct 26, 2016 at 3:20 PM, <josef.pktd@gmail.com> wrote:
I'm not sure I agree with that statement. If a user wants to calculate both a covariance and correlation matrix, they currently have two options: A) Call np.cov and np.corrcoef separately, which takes at least twice as long as one call to np.cov. For data-sets that I am used to, a np.cov call takes 5-10 seconds. B) Call np.cov and then separately implement their own correlation matrix code, which means the user isn't able to fully take advantage of code that is already in numpy. In any case, I've updated the PR: https://github.com/numpy/numpy/pull/8211 Relative to my original PR, it: a) removes the numpy.covcorr function which the user can easily implement b) have numpy.cov2corr be the function exposed in the API (previously called numpy.covtocorr in the PR), which accepts a pre-calculated covariance matrix c) have numpy.corrcoef call numpy.cov2corr
participants (4)
-
josef.pktd@gmail.com
-
Mathew S. Madhavacheril
-
Nathaniel Smith
-
Stephan Hoyer