Le 26/01/2012 15:57, Bruce Southey a écrit :
Can you please provide a couple of real examples with expected output that clearly show what you want?
Hi Bruce, Thanks for your ticket feedback ! It's precisely because I see a big potential impact of the proposed change that I send first a ML message, second a ticket before jumping to a pull-request like a Sergio Leone's cowboy (sorry, I watched "for a few dollars more" last weekend...) Now, I realize that in the ticket writing I made the wrong trade-off between conciseness and accuracy which led to some of the errors you raised. Let me try to use your example to try to share what I have in mind.
X = array([-2.1, -1. , 4.3]) Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
cov(X,Y) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough because I meant assembling X and Y in the sense of 2 vectors of observations from 2 random variables X and Y. This is achieved by concatenate(X,Y) *when properly playing with dimensions* (which I didn't mentioned) :
XY = np.concatenate((X[None, :], Y[None, :])) array([[-2.1 , -1. , 4.3 ], [ 3. , 1.1 , 0.12]])
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
np.cov(XY) array([[ 11.71 , -4.286 ], [ -4.286 , 2.14413333]])
(And indeed, the actual cov Python code does use concatenate() ) Now let me come back to my assertion about this behavior *usefulness*. You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 simple scalars blocks). * diagonal blocks are just cov(X) and cov(Y) (which in this case comes to var(X) and var(Y) when setting ddof to 1) * off diagonal blocks are symetric and are actually the covariance estimate of X, Y observations (from http://en.wikipedia.org/wiki/Covariance) that is :
((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) -4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return : array(-4.2860000000000005) instead of the 2*2 matrix. * This would be in line with the cov(X,Y) mathematical definition, as well as with R behavior. * This would save memory and computing resources. (and therefore help save the planet ;-) ) However, I do understand that the impact for this change may be big. This indeed requires careful reviewing. Pierre