[Numpy-discussion] Cross-covariance function

Thu Jan 26 21:45:49 EST 2012

On Thu, Jan 26, 2012 at 6:43 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jan 26, 2012 at 3:58 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>> On Thu, Jan 26, 2012 at 12:45 PM,  <josef.pktd at gmail.com> wrote:
>>> On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>>> On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig
>>>> <pierre.haessig at crans.org> wrote:
>>>>> Le 26/01/2012 15:57, Bruce Southey a écrit :
>>>>>> Can you please provide a
>>>>>> couple of real examples with expected output that clearly show what
>>>>>> you want?
>>>>>>
>>>>> Hi Bruce,
>>>>>
>>>>> Thanks for your ticket feedback ! It's precisely because I see a big
>>>>> potential impact of the proposed change that I send first a ML message,
>>>>> second a ticket before jumping to a pull-request like a Sergio Leone's
>>>>> cowboy (sorry, I watched "for a few dollars more" last weekend...)
>>>>>
>>>>> Now, I realize that in the ticket writing I made the wrong trade-off
>>>>> between conciseness and accuracy which led to some of the errors you
>>>>> raised. Let me try to use your example to try to share what I have in mind.
>>>>>
>>>>>> >> X = array([-2.1, -1. ,  4.3])
>>>>>> >> Y = array([ 3.  ,  1.1 ,  0.12])
>>>>>
>>>>> Indeed, with today's cov behavior we have a 2x2 array:
>>>>>> >> cov(X,Y)
>>>>> array([[ 11.71      ,  -4.286     ],
>>>>>        [ -4.286     ,   2.14413333]])
>>>>>
>>>>> Now, when I used the word 'concatenation', I wasn't precise enough
>>>>> because I meant assembling X and Y in the sense of 2 vectors of
>>>>> observations from 2 random variables X and Y.
>>>>> This is achieved by concatenate(X,Y) *when properly playing with
>>>>> dimensions* (which I didn't mentioned) :
>>>>>> >> XY = np.concatenate((X[None, :], Y[None, :]))
>>>>> array([[-2.1 , -1.  ,  4.3 ],
>>>>>        [ 3.  ,  1.1 ,  0.12]])
>>>>
>>>> In this context, I find stacking,  np.vstack((X,Y)), more appropriate
>>>> than concatenate.
>>>>
>>>>>
>>>>> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>>>>>> >> np.cov(XY)
>>>>> array([[ 11.71      ,  -4.286     ],
>>>>>        [ -4.286     ,   2.14413333]])
>>>>>
>>>> Sure the resulting array is the same but whole process is totally different.
>>>>
>>>>
>>>>> (And indeed, the actual cov Python code does use concatenate() )
>>>> Yes, but the user does not see that. Whereas you are forcing the user
>>>> to do the stacking in the correct dimensions.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Now let me come back to my assertion about this behavior *usefulness*.
>>>>> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
>>>>> simple scalars blocks).
>>>> No there are not '4' blocks just rows and columns.
>>>
>>> Sturla showed the 4 blocks in his first message.
>>>
>> Well, I could not follow that because the code is wrong.
>> X = np.array([-2.1, -1. ,  4.3])
>>>>> cX = X - X.mean(axis=0)[np.newaxis,:]
>>
>> Traceback (most recent call last):
>>  File "<pyshell#6>", line 1, in <module>
>>    cX = X - X.mean(axis=0)[np.newaxis,:]
>> IndexError: 0-d arrays can only use a single () or a list of newaxes
>> (and a single ...) as an index
>>  whoops!
>>
>> Anyhow, variance-covariance matrix is symmetric but numpy or scipy
>> lacks  lapac's symmetrix matrix
>> (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
>>
>>>>
>>>>>  * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
>>>>> to var(X) and var(Y) when setting ddof to 1)
>>>> Sure but variances are still covariances.
>>>>
>>>>>  * off diagonal blocks are symetric and are actually the covariance
>>>>> estimate of X, Y observations (from
>>>>> http://en.wikipedia.org/wiki/Covariance)
>>>> Sure
>>>>>
>>>>> that is :
>>>>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
>>>>> -4.2860000000000005
>>>>>
>>>>> The new proposed behaviour for cov is that cov(X,Y) would return :
>>>>> array(-4.2860000000000005)  instead of the 2*2 matrix.
>>>>
>>>> But how you interpret an 2D array where the rows are greater than 2?
>>>>>>> Z=Y+X
>>>>>>> np.cov(np.vstack((X,Y,Z)))
>>>> array([[ 11.71      ,  -4.286     ,   7.424     ],
>>>>       [ -4.286     ,   2.14413333,  -2.14186667],
>>>>       [  7.424     ,  -2.14186667,   5.28213333]])
>>>>
>>>>
>>>>>
>>>>>  * This would be in line with the cov(X,Y) mathematical definition, as
>>>>> well as with R behavior.
>>>> I don't care what R does because I am using Python and Python is
>>>> infinitely better than R is!
>>>>
>>>> But I think that is only in the 1D case.
>>>
>>> I just checked R to make sure I remember correctly
>>>
>>>> xx = matrix((1:20)^2, nrow=4)
>>>> xx
>>>     [,1] [,2] [,3] [,4] [,5]
>>> [1,]    1   25   81  169  289
>>> [2,]    4   36  100  196  324
>>> [3,]    9   49  121  225  361
>>> [4,]   16   64  144  256  400
>>>> cov(xx, 2*xx[,1:2])
>>>         [,1]      [,2]
>>> [1,]  86.0000  219.3333
>>> [2,] 219.3333  566.0000
>>> [3,] 352.6667  912.6667
>>> [4,] 486.0000 1259.3333
>>> [5,] 619.3333 1606.0000
>>>> cov(xx)
>>>         [,1]     [,2]      [,3]      [,4]      [,5]
>>> [1,]  43.0000 109.6667  176.3333  243.0000  309.6667
>>> [2,] 109.6667 283.0000  456.3333  629.6667  803.0000
>>> [3,] 176.3333 456.3333  736.3333 1016.3333 1296.3333
>>> [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667
>>> [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
>>>
>>>
>>>>
>>>>>  * This would save memory and computing resources. (and therefore help
>>>>> save the planet ;-) )
>>>> Nothing that you have provided shows that it will.
>>>
>>> I don't know about saving the planet, but if X and Y have the same
>>> number of columns, we save 3 quarters of the calculations, as Sturla
>>> also explained in his first message.
>>>
>> Can not figure those savings:
>> For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%)
>> a 3 by 3  output has 6 covariances
>> a 5 by 5 output 15 covariances
>
> what numpy calculates are 4, 9 and 25 covariances, we might care only
> about 1, 2 and 4 of them.
>
>>
>> If you want to save memory and calculation then use symmetric storage
>> and associated methods.
>
> actually for covariance matrix we stilll need to subtract means, so we
> won't save 75%, but we save 75% in the cross-product.
>
> suppose X and Y are (nobs, k_x) and (nobs, k_y)   (means already subtracted)
> (and ignoring that numpy "likes" rows instead of columns)
>
> the partitioned dot product  [X,Y]'[X,Y] is
>
> [[ X'X, X'Y],
>  [Y'X, Y'Y]]
>
> X'Y is (n_x, n_y)
> total shape is (n_x + n_y, n_x + n_y)
>
> If we are only interested in X'Y, we don't need the other three submatrices.
>
> If n_x = 99 and n_y is 1, we save .... ?
> (we get a (99,1) instead of a (100, 100) matrix)
>
> and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so
> exploiting symmetry is a different issue.
>
> Josef
>
>>
>> Bruce
>>
>>> Josef
>>>
>>>>
>>>>>
>>>>> However, I do understand that the impact for this change may be big.
>>>>> This indeed requires careful reviewing.
>>>>>
>>>>> Pierre
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list
>>>>> NumPy-Discussion at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>>
>>>> Bruce
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Thanks for someone to clearly state what they want.
But still lacks evidence that it will save the world - when nobs is
large, n_x and n_y are meaningless and thus  (99,1) vs (100, 100) is
also meaningless.
Further dealing separately with the two arrays also bring additional
overhead - small not zero.

Bruce