Re: [Numpy-discussion] Cross-covariance function
Thank you Sturla, that's exactly what I want. I'm sorry that I was not able to reply for so long, but Pierre's code is similar to what I have already implemented, and I am in support of changing the functionality of cov(). I am unaware of any arguments for a covariance function that works in this way, except for the fact that the MATLAB cov() function behaves in the same way. [1] MATLAB, however, has an xcov() function, which is similar to what we have been discussing. [2] Unless you all wish to retain compatibility with MATLAB, I feel that the behaviour of cov() suggested by Pierre is the most straightforward method, and that if users wish to calculate the covariance of X concatenated with Y, then they may simply concatenate the matrices explicitly before passing into cov(), as this way the default method does not use 75% more CPU time. Again, if there is an argument for this functionality, I would love to learn of it! -E [1] http://www.mathworks.com/help//techdoc/ref/cov.html [2] http://www.mathworks.com/help/toolbox/signal/ref/xcov.html
I ran into this a while ago and was confused why cov did not behave the way pierre suggested. On Jan 21, 2012 12:48 PM, "Elliot Saba" <staticfloat@gmail.com> wrote:
Thank you Sturla, that's exactly what I want.
I'm sorry that I was not able to reply for so long, but Pierre's code is similar to what I have already implemented, and I am in support of changing the functionality of cov(). I am unaware of any arguments for a covariance function that works in this way, except for the fact that the MATLAB cov() function behaves in the same way. [1]
MATLAB, however, has an xcov() function, which is similar to what we have been discussing. [2]
Unless you all wish to retain compatibility with MATLAB, I feel that the behaviour of cov() suggested by Pierre is the most straightforward method, and that if users wish to calculate the covariance of X concatenated with Y, then they may simply concatenate the matrices explicitly before passing into cov(), as this way the default method does not use 75% more CPU time.
Again, if there is an argument for this functionality, I would love to learn of it! -E
[1] http://www.mathworks.com/help//techdoc/ref/cov.html [2] http://www.mathworks.com/help/toolbox/signal/ref/xcov.html
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Jan 21, 2012 at 6:26 PM, John Salvatier <jsalvati@u.washington.edu> wrote:
I ran into this a while ago and was confused why cov did not behave the way pierre suggested.
same here, When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for two arrays, while R only returns the cross-correlation part. Josef
On Jan 21, 2012 12:48 PM, "Elliot Saba" <staticfloat@gmail.com> wrote:
Thank you Sturla, that's exactly what I want.
I'm sorry that I was not able to reply for so long, but Pierre's code is similar to what I have already implemented, and I am in support of changing the functionality of cov(). I am unaware of any arguments for a covariance function that works in this way, except for the fact that the MATLAB cov() function behaves in the same way. [1]
MATLAB, however, has an xcov() function, which is similar to what we have been discussing. [2]
Unless you all wish to retain compatibility with MATLAB, I feel that the behaviour of cov() suggested by Pierre is the most straightforward method, and that if users wish to calculate the covariance of X concatenated with Y, then they may simply concatenate the matrices explicitly before passing into cov(), as this way the default method does not use 75% more CPU time.
Again, if there is an argument for this functionality, I would love to learn of it! -E
[1] http://www.mathworks.com/help//techdoc/ref/cov.html [2] http://www.mathworks.com/help/toolbox/signal/ref/xcov.html
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Le 22/01/2012 01:40, josef.pktd@gmail.com a écrit :
same here, When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for two arrays, while R only returns the cross-correlation part. Since I've seen no negative feedback, I jumped to the next step by creating a Trac account and posting a new ticket :
http://projects.scipy.org/numpy/ticket/2031 If people feel ok with this proposal, I can try to expand the proposed implementation skeleton to something more serious. But maybe Elliot has already something ready to pull-request on GitHub ? Pierre
On Thu, Jan 26, 2012 at 7:19 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 22/01/2012 01:40, josef.pktd@gmail.com a écrit :
same here, When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for two arrays, while R only returns the cross-correlation part. Since I've seen no negative feedback, I jumped to the next step by creating a Trac account and posting a new ticket :
http://projects.scipy.org/numpy/ticket/2031
If people feel ok with this proposal, I can try to expand the proposed implementation skeleton to something more serious. But maybe Elliot has already something ready to pull-request on GitHub ?
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Really I do not understand what you want to do especially when the ticket contains some very basic errors. Can you please provide a couple of real examples with expected output that clearly show what you want?
From a statistical viewpoint, np.cov is correct because it outputs the variance/covariance matrix. Also I believe that changing the np.cov function will cause major havoc because numpy and people's code depend on the current behavior.
Bruce
26.01.2012 15:57, Bruce Southey kirjoitti: [clip]
Also I believe that changing the np.cov function will cause major havoc because numpy and people's code depend on the current behavior.
Changing the behavior of `cov` is IMHO not really possible at this point --- the current behavior is not a bug, but a documented feature that has been around probably already since Numeric. However, adding a new function could be possible. -- Pauli Virtanen
Le 26/01/2012 16:50, Pauli Virtanen a écrit :
the current behavior is not a bug, I completely agree that numpy.cov(m,y) does what it says !
I (and apparently some other people) are only questioning why there is such a behavior ? Indeed, the second variable `y` is presented as "An additional set of variables and observations". This raises for me two different questions : * What is the use case for such an additional set of variables that could just be concatenated to the first set `̀m` ? * Or, if indeed this sort of integrated concatenation is useful, why just add one "additional set" and not several "additional sets" like :
cov(m, y1, y2, y3, ....) ?
But I would understand that numpy responsibility to provide a stable computing API would prevent any change in cov behavior. You have the long term experience to judge that. (I certainly don't ;-) ) However, in the case this change is not possible, I would see this solution : * add and xcov function that does what Elliot and Sturla and I described, because * possibly deprecate the `y` 2nd argument of cov because I feel it brings more definition complication than real programming benefits (but I still find that changing cov would lead to a leaner numpy API which was my motivation for reacting to Elliot's first message) Pierre
Den 26.01.2012 17:25, skrev Pierre Haessig:
However, in the case this change is not possible, I would see this solution : * add and xcov function that does what Elliot and Sturla and I described, because
The current np.cov implementation returns the cross-covariance the way it is commonly used in statistics. If MATLAB does not, then that is MATLAB's problem I think. http://www.stat.washington.edu/research/reports/2001/tr391.pdf Sturla
On Thu, Jan 26, 2012 at 12:26 PM, Sturla Molden <sturla@molden.no> wrote:
Den 26.01.2012 17:25, skrev Pierre Haessig:
However, in the case this change is not possible, I would see this solution : * add and xcov function that does what Elliot and Sturla and I described, because
The current np.cov implementation returns the cross-covariance the way it is commonly used in statistics. If MATLAB does not, then that is MATLAB's problem I think.
The discussion had this reversed, numpy matches the behavior of MATLAB, while R (statistics) only returns the cross covariance part as proposed. If there is a new xcov, then I think there should also be a xcorrcoef. This case needs a different implementation than corrcoef, since the xcov doesn't contain the variances and they need to be calculated separately. Josef
http://www.stat.washington.edu/research/reports/2001/tr391.pdf
Sturla
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Le 26/01/2012 19:19, josef.pktd@gmail.com a écrit :
The discussion had this reversed, numpy matches the behavior of MATLAB, while R (statistics) only returns the cross covariance part as proposed.
I would also say that there was an attempt to match MATLAB behavior. However, there is big difference with numpy.cov because of the default value `rowvar` being True. Most softwares and textbooks I know consider that, in a 2D context, matrix rows are obvervations while columns are the variables. Any idea why the "transposed" convention was selected in np.cov ? (This question, I'm raising for informative purpose only... ;-) ) I also compared with octave to see how it works : -- Function File: cov (X, Y) Compute covariance. If each row of X and Y is an observation and each column is a variable, the (I, J)-th entry of `cov (X, Y)' is the covariance between the I-th variable in X and the J-th variable in Y. If called with one argument, compute `cov (X, X)'. (http://www.gnu.org/software/octave/doc/interpreter/Correlation-and-Regressio...) I like the clear tone of this description. But strangely enough, this a bit different from Matlab. (http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a)
If there is a new xcov, then I think there should also be a xcorrcoef. This case needs a different implementation than corrcoef, since the xcov doesn't contain the variances and they need to be calculated separately. Adding xcorrcoeff as well would make sense. The use of the np.var when setting the `axis` and `̀̀ddof` arguments to appropriate values should the bring variances needed for the normalization.
In the end, if adding xcov is the path of least resistance, this may be the way to go. What do people think ? Pierre
On Friday, January 27, 2012, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 19:19, josef.pktd@gmail.com a écrit :
The discussion had this reversed, numpy matches the behavior of MATLAB, while R (statistics) only returns the cross covariance part as proposed.
I would also say that there was an attempt to match MATLAB behavior. However, there is big difference with numpy.cov because of the default value `rowvar` being True. Most softwares and textbooks I know consider that, in a 2D context, matrix rows are obvervations while columns are the variables.
Any idea why the "transposed" convention was selected in np.cov ? (This question, I'm raising for informative purpose only... ;-) )
I also compared with octave to see how it works : -- Function File: cov (X, Y) Compute covariance.
If each row of X and Y is an observation and each column is a variable, the (I, J)-th entry of `cov (X, Y)' is the covariance between the I-th variable in X and the J-th variable in Y. If called with one argument, compute `cov (X, X)'.
( http://www.gnu.org/software/octave/doc/interpreter/Correlation-and-Regressio... ) I like the clear tone of this description. But strangely enough, this a bit different from Matlab. ( http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a )
If there is a new xcov, then I think there should also be a xcorrcoef. This case needs a different implementation than corrcoef, since the xcov doesn't contain the variances and they need to be calculated separately. Adding xcorrcoeff as well would make sense. The use of the np.var when setting the `axis` and `̀̀ddof` arguments to appropriate values should the bring variances needed for the normalization.
In the end, if adding xcov is the path of least resistance, this may be the way to go. What do people think ?
Pierre
My vote is for xcov() and xcorrcoeff(). It won't break compatibility, and the name of the function makes it clear what it does. It would also make sense to add "seealso" references to each other in the docstrings. The documentation for xcov() should also make it clear the differences between cov() and xcov() with examples and show how to get equivalent results using just cov() for those with older versions of numpy. Ben Root
On 01/27/2012 09:00 AM, Benjamin Root wrote:
On Friday, January 27, 2012, Pierre Haessig <pierre.haessig@crans.org <mailto:pierre.haessig@crans.org>> wrote:
Le 26/01/2012 19:19, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> a écrit :
The discussion had this reversed, numpy matches the behavior of MATLAB, while R (statistics) only returns the cross covariance part as proposed.
I would also say that there was an attempt to match MATLAB behavior. However, there is big difference with numpy.cov because of the default value `rowvar` being True. Most softwares and textbooks I know consider that, in a 2D context, matrix rows are obvervations while columns are the variables.
Any idea why the "transposed" convention was selected in np.cov ? (This question, I'm raising for informative purpose only... ;-) )
I also compared with octave to see how it works : -- Function File: cov (X, Y) Compute covariance.
If each row of X and Y is an observation and each column is a variable, the (I, J)-th entry of `cov (X, Y)' is the covariance between the I-th variable in X and the J-th variable in Y. If called with one argument, compute `cov (X, X)'.
(http://www.gnu.org/software/octave/doc/interpreter/Correlation-and-Regressio...)
I like the clear tone of this description. But strangely enough, this a bit different from Matlab.
(http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a <http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a>)
If there is a new xcov, then I think there should also be a xcorrcoef. This case needs a different implementation than corrcoef, since the xcov doesn't contain the variances and they need to be calculated separately. Adding xcorrcoeff as well would make sense. The use of the np.var when setting the `axis` and `??ddof` arguments to appropriate values
should the
bring variances needed for the normalization.
In the end, if adding xcov is the path of least resistance, this may be the way to go. What do people think ?
Pierre
My vote is for xcov() and xcorrcoeff(). It won't break compatibility, and the name of the function makes it clear what it does. It would also make sense to add "seealso" references to each other in the docstrings. The documentation for xcov() should also make it clear the differences between cov() and xcov() with examples and show how to get equivalent results using just cov() for those with older versions of numpy.
Ben Root
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-1 because these are too close to cross-correlation as used by signal processing. The output is still a covariance so do we really need yet another set of very similar functions to maintain? Or can we get away with a new keyword? If speed really matters to you guys then surely moving np.cov into C would have more impact on 'saving the world' than this proposal. That also ignores algorithm used ( http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance). Actually np.cov also is deficient in that it does not have the dtype argument so it is prone to numerical precision errors (especially getting the mean of the array). Probably should be a ticket... Bruce
Hi Bruce, Sorry for the delay in the answer. Le 27/01/2012 17:28, Bruce Southey a écrit :
The output is still a covariance so do we really need yet another set of very similar functions to maintain? Or can we get away with a new keyword?
The idea of an additional keyword seems appealing. Just to make sure I understood it well, you woud be proposing a new signature like : def cov(.... get_full_cov_matrix=True) and when `get_full_cov_matrix` is set to False, only the cross covariance part would be returned. Am I right ?
If speed really matters to you guys then surely moving np.cov into C would have more impact on 'saving the world' than this proposal. That also ignores algorithm used ( http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance).
Actually np.cov also is deficient in that it does not have the dtype argument so it is prone to numerical precision errors (especially getting the mean of the array). Probably should be a ticket... I'm not a specialist of numerical precisions, but I got very impressed by the recent example raised on Jan 24th by Michael Aye which was one of
I didn't get your point about the algorithm here. From this nomenclature, I would say that numpy.cov is based on a vectorized "two-pass algorithm" which computes the means first and then substracts it before computing the matrix product. Would you make it different ? the first "real life" example I've seen. The way I see the cov algorithm, I see first a possibility to propagate an optional dtype argument to the mean computation. However, I'm unsure about what to do after, for the matrix product since "dot(X.T, X.conj()) / fact" is also a sort of mean computation. Therefore it can also be affected by numerical precision issue. What would you suggest ? (the only solution I see would be to use the running variance algorithm. Since the code wouldn't be vectorized anymore, this indeed would benefits from going to C) Best, Pierre
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce, Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...) Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
X = array([-2.1, -1. , 4.3]) Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
(And indeed, the actual cov Python code does use concatenate() ) Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks). * diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) * off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) that is :
((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix. * This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior. * This would save memory and computing resources. (and therefore help save the planet ;-) ) However, I do understand that the impact for this change may be big. This indeed requires careful reviewing. Pierre
On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce,
Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
X = array([-2.1, -1. , 4.3]) Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this context, I find stacking, np.vstack((X,Y)), more appropriate than concatenate.
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Sure the resulting array is the same but whole process is totally different.
(And indeed, the actual cov Python code does use concatenate() ) Yes, but the user does not see that. Whereas you are forcing the user to do the stacking in the correct dimensions.
Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks).
No there are not '4' blocks just rows and columns.
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) Sure but variances are still covariances.
* off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) Sure
that is :
((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
Z=Y+X np.cov(np.vstack((X,Y,Z))) array([[ 11.71 , -4.286 , 7.424 ], [ -4.286 , 2.14413333, -2.14186667], [ 7.424 , -2.14186667, 5.28213333]])
* This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior.
I don't care what R does because I am using Python and Python is infinitely better than R is! But I think that is only in the 1D case.
* This would save memory and computing resources. (and therefore help save the planet ;-) ) Nothing that you have provided shows that it will.
However, I do understand that the impact for this change may be big. This indeed requires careful reviewing.
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce
On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce,
Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
X = array([-2.1, -1. , 4.3]) Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this context, I find stacking, np.vstack((X,Y)), more appropriate than concatenate.
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Sure the resulting array is the same but whole process is totally different.
(And indeed, the actual cov Python code does use concatenate() ) Yes, but the user does not see that. Whereas you are forcing the user to do the stacking in the correct dimensions.
Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks).
No there are not '4' blocks just rows and columns.
Sturla showed the 4 blocks in his first message.
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) Sure but variances are still covariances.
* off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) Sure
that is :
((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
Z=Y+X np.cov(np.vstack((X,Y,Z))) array([[ 11.71 , -4.286 , 7.424 ], [ -4.286 , 2.14413333, -2.14186667], [ 7.424 , -2.14186667, 5.28213333]])
* This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior.
I don't care what R does because I am using Python and Python is infinitely better than R is!
But I think that is only in the 1D case.
I just checked R to make sure I remember correctly
xx = matrix((1:20)^2, nrow=4) xx [,1] [,2] [,3] [,4] [,5] [1,] 1 25 81 169 289 [2,] 4 36 100 196 324 [3,] 9 49 121 225 361 [4,] 16 64 144 256 400 cov(xx, 2*xx[,1:2]) [,1] [,2] [1,] 86.0000 219.3333 [2,] 219.3333 566.0000 [3,] 352.6667 912.6667 [4,] 486.0000 1259.3333 [5,] 619.3333 1606.0000 cov(xx) [,1] [,2] [,3] [,4] [,5] [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
* This would save memory and computing resources. (and therefore help save the planet ;-) ) Nothing that you have provided shows that it will.
I don't know about saving the planet, but if X and Y have the same number of columns, we save 3 quarters of the calculations, as Sturla also explained in his first message. Josef
However, I do understand that the impact for this change may be big. This indeed requires careful reviewing.
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jan 26, 2012 at 12:45 PM, <josef.pktd@gmail.com> wrote:
On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce,
Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
X = array([-2.1, -1. , 4.3]) Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this context, I find stacking, np.vstack((X,Y)), more appropriate than concatenate.
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Sure the resulting array is the same but whole process is totally different.
(And indeed, the actual cov Python code does use concatenate() ) Yes, but the user does not see that. Whereas you are forcing the user to do the stacking in the correct dimensions.
Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks).
No there are not '4' blocks just rows and columns.
Sturla showed the 4 blocks in his first message.
Well, I could not follow that because the code is wrong. X = np.array([-2.1, -1. , 4.3])
cX = X - X.mean(axis=0)[np.newaxis,:]
Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> cX = X - X.mean(axis=0)[np.newaxis,:] IndexError: 0-d arrays can only use a single () or a list of newaxes (and a single ...) as an index whoops! Anyhow, variance-covariance matrix is symmetric but numpy or scipy lacks lapac's symmetrix matrix (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) Sure but variances are still covariances.
* off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) Sure
that is :
((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
Z=Y+X np.cov(np.vstack((X,Y,Z))) array([[ 11.71 , -4.286 , 7.424 ], [ -4.286 , 2.14413333, -2.14186667], [ 7.424 , -2.14186667, 5.28213333]])
* This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior.
I don't care what R does because I am using Python and Python is infinitely better than R is!
But I think that is only in the 1D case.
I just checked R to make sure I remember correctly
xx = matrix((1:20)^2, nrow=4) xx [,1] [,2] [,3] [,4] [,5] [1,] 1 25 81 169 289 [2,] 4 36 100 196 324 [3,] 9 49 121 225 361 [4,] 16 64 144 256 400 cov(xx, 2*xx[,1:2]) [,1] [,2] [1,] 86.0000 219.3333 [2,] 219.3333 566.0000 [3,] 352.6667 912.6667 [4,] 486.0000 1259.3333 [5,] 619.3333 1606.0000 cov(xx) [,1] [,2] [,3] [,4] [,5] [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
* This would save memory and computing resources. (and therefore help save the planet ;-) ) Nothing that you have provided shows that it will.
I don't know about saving the planet, but if X and Y have the same number of columns, we save 3 quarters of the calculations, as Sturla also explained in his first message.
Can not figure those savings: For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%) a 3 by 3 output has 6 covariances a 5 by 5 output 15 covariances If you want to save memory and calculation then use symmetric storage and associated methods. Bruce
Josef
However, I do understand that the impact for this change may be big. This indeed requires careful reviewing.
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jan 26, 2012 at 3:58 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 12:45 PM, <josef.pktd@gmail.com> wrote:
On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce,
Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
> X = array([-2.1, -1. , 4.3]) > Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
> cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
> XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this context, I find stacking, np.vstack((X,Y)), more appropriate than concatenate.
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
> np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Sure the resulting array is the same but whole process is totally different.
(And indeed, the actual cov Python code does use concatenate() ) Yes, but the user does not see that. Whereas you are forcing the user to do the stacking in the correct dimensions.
Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks).
No there are not '4' blocks just rows and columns.
Sturla showed the 4 blocks in his first message.
Well, I could not follow that because the code is wrong. X = np.array([-2.1, -1. , 4.3])
cX = X - X.mean(axis=0)[np.newaxis,:]
Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> cX = X - X.mean(axis=0)[np.newaxis,:] IndexError: 0-d arrays can only use a single () or a list of newaxes (and a single ...) as an index whoops!
Anyhow, variance-covariance matrix is symmetric but numpy or scipy lacks lapac's symmetrix matrix (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) Sure but variances are still covariances.
* off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) Sure
that is :
> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
Z=Y+X np.cov(np.vstack((X,Y,Z))) array([[ 11.71 , -4.286 , 7.424 ], [ -4.286 , 2.14413333, -2.14186667], [ 7.424 , -2.14186667, 5.28213333]])
* This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior.
I don't care what R does because I am using Python and Python is infinitely better than R is!
But I think that is only in the 1D case.
I just checked R to make sure I remember correctly
xx = matrix((1:20)^2, nrow=4) xx [,1] [,2] [,3] [,4] [,5] [1,] 1 25 81 169 289 [2,] 4 36 100 196 324 [3,] 9 49 121 225 361 [4,] 16 64 144 256 400 cov(xx, 2*xx[,1:2]) [,1] [,2] [1,] 86.0000 219.3333 [2,] 219.3333 566.0000 [3,] 352.6667 912.6667 [4,] 486.0000 1259.3333 [5,] 619.3333 1606.0000 cov(xx) [,1] [,2] [,3] [,4] [,5] [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
* This would save memory and computing resources. (and therefore help save the planet ;-) ) Nothing that you have provided shows that it will.
I don't know about saving the planet, but if X and Y have the same number of columns, we save 3 quarters of the calculations, as Sturla also explained in his first message.
Can not figure those savings: For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%) a 3 by 3 output has 6 covariances a 5 by 5 output 15 covariances
what numpy calculates are 4, 9 and 25 covariances, we might care only about 1, 2 and 4 of them.
If you want to save memory and calculation then use symmetric storage and associated methods.
actually for covariance matrix we stilll need to subtract means, so we won't save 75%, but we save 75% in the cross-product. suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted) (and ignoring that numpy "likes" rows instead of columns) the partitioned dot product [X,Y]'[X,Y] is [[ X'X, X'Y], [Y'X, Y'Y]] X'Y is (n_x, n_y) total shape is (n_x + n_y, n_x + n_y) If we are only interested in X'Y, we don't need the other three submatrices. If n_x = 99 and n_y is 1, we save .... ? (we get a (99,1) instead of a (100, 100) matrix) and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so exploiting symmetry is a different issue. Josef
Bruce
Josef
However, I do understand that the impact for this change may be big. This indeed requires careful reviewing.
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Jan 26, 2012 at 6:43 PM, <josef.pktd@gmail.com> wrote:
On Thu, Jan 26, 2012 at 3:58 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 12:45 PM, <josef.pktd@gmail.com> wrote:
On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig <pierre.haessig@crans.org> wrote:
Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce,
Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
>> X = array([-2.1, -1. , 4.3]) >> Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
>> cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
>> XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this context, I find stacking, np.vstack((X,Y)), more appropriate than concatenate.
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>> np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Sure the resulting array is the same but whole process is totally different.
(And indeed, the actual cov Python code does use concatenate() ) Yes, but the user does not see that. Whereas you are forcing the user to do the stacking in the correct dimensions.
Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks).
No there are not '4' blocks just rows and columns.
Sturla showed the 4 blocks in his first message.
Well, I could not follow that because the code is wrong. X = np.array([-2.1, -1. , 4.3])
cX = X - X.mean(axis=0)[np.newaxis,:]
Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> cX = X - X.mean(axis=0)[np.newaxis,:] IndexError: 0-d arrays can only use a single () or a list of newaxes (and a single ...) as an index whoops!
Anyhow, variance-covariance matrix is symmetric but numpy or scipy lacks lapac's symmetrix matrix (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) Sure but variances are still covariances.
* off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) Sure
that is :
>> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
> Z=Y+X > np.cov(np.vstack((X,Y,Z))) array([[ 11.71 , -4.286 , 7.424 ], [ -4.286 , 2.14413333, -2.14186667], [ 7.424 , -2.14186667, 5.28213333]])
* This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior.
I don't care what R does because I am using Python and Python is infinitely better than R is!
But I think that is only in the 1D case.
I just checked R to make sure I remember correctly
xx = matrix((1:20)^2, nrow=4) xx [,1] [,2] [,3] [,4] [,5] [1,] 1 25 81 169 289 [2,] 4 36 100 196 324 [3,] 9 49 121 225 361 [4,] 16 64 144 256 400 cov(xx, 2*xx[,1:2]) [,1] [,2] [1,] 86.0000 219.3333 [2,] 219.3333 566.0000 [3,] 352.6667 912.6667 [4,] 486.0000 1259.3333 [5,] 619.3333 1606.0000 cov(xx) [,1] [,2] [,3] [,4] [,5] [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
* This would save memory and computing resources. (and therefore help save the planet ;-) ) Nothing that you have provided shows that it will.
I don't know about saving the planet, but if X and Y have the same number of columns, we save 3 quarters of the calculations, as Sturla also explained in his first message.
Can not figure those savings: For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%) a 3 by 3 output has 6 covariances a 5 by 5 output 15 covariances
what numpy calculates are 4, 9 and 25 covariances, we might care only about 1, 2 and 4 of them.
If you want to save memory and calculation then use symmetric storage and associated methods.
actually for covariance matrix we stilll need to subtract means, so we won't save 75%, but we save 75% in the cross-product.
suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted) (and ignoring that numpy "likes" rows instead of columns)
the partitioned dot product [X,Y]'[X,Y] is
[[ X'X, X'Y], [Y'X, Y'Y]]
X'Y is (n_x, n_y) total shape is (n_x + n_y, n_x + n_y)
If we are only interested in X'Y, we don't need the other three submatrices.
If n_x = 99 and n_y is 1, we save .... ? (we get a (99,1) instead of a (100, 100) matrix)
and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so exploiting symmetry is a different issue.
Josef
Bruce
Josef
However, I do understand that the impact for this change may be big. This indeed requires careful reviewing.
Pierre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thanks for someone to clearly state what they want. But still lacks evidence that it will save the world - when nobs is large, n_x and n_y are meaningless and thus (99,1) vs (100, 100) is also meaningless. Further dealing separately with the two arrays also bring additional overhead - small not zero. Bruce
participants (8)
-
Benjamin Root -
Bruce Southey -
Elliot Saba -
John Salvatier -
josef.pktd@gmail.com -
Pauli Virtanen -
Pierre Haessig -
Sturla Molden