[SciPy-User] Strange behaviour from corrcoef when calculating correlation-matrix in SciPy/NumPy.

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Mar 3 19:07:25 EST 2011


On Thu, Mar 3, 2011 at 6:34 PM, Raj <rajeev.raizada at gmail.com> wrote:
> On Mar 3, 3:18 pm, eat <e.antero.ta... at gmail.com> wrote:
>> So perhaps there does not exist any really simple and straightforward
>> translation
>> (of corrcoef) from matlab to numpy? Just as an example; how would you
>> implement case %(3 properly  with numpy?
>> Regards,
>> eat
>
> It turns out that Matlab also embodies some confusion
> on this front, as it turns out that Matlab has
> two different functions for computing correlation!
>
> One is corr(), which is in the Matlab Stats Toolbox.
> This is the one that I have always used,
> and it is better-behaved, in my opinion,
> as I argue below.
>
> The other Matlab function is corrcoef().
> This is not the Stats Toolbox function, it's in the main code base.
> I didn't even know that this function existed until this thread!  :-)
>
> In my view, the Matlab function corr() is the one to emulate.
> It has the very desirable property that corr(m,m) and corr(m)
> are the same.
>
> Also, its behaviour when correlating two different matrices
> is very reasonable:
> http://www.mathworks.com/help/toolbox/stats/corr.html
> RHO = corr(X,Y) returns a p1-by-p2 matrix containing the pairwise
> correlation coefficient between each pair of columns in the n-by-p1
> and n-by-p2 matrices X and Y.
>
>>> m1 = [ 1 2; -1 3; 0 4]
> m1 =
>     1     2
>    -1     3
>     0     4
>
>>> corr(m1)
> ans =
>    1.0000   -0.5000
>   -0.5000    1.0000
>
>>> corr(m1,m1)
> ans =
>    1.0000   -0.5000
>   -0.5000    1.0000
>
>>> m2 = [ -1 1; 2 -1; -1 3]
> m2 =
>    -1     1
>     2    -1
>    -1     3
>
>>> corr(m1,m2)
> ans =
>   -0.8660    0.5000
>         0    0.5000
>
> In contrast, the Matlab corrcoef() does weird things,
> and is almost as bad as the SciPy corrcoef() function in that regard.
>
>>> corrcoef(m1)
> ans =
>    1.0000   -0.5000
>   -0.5000    1.0000
>
>>> corrcoef(m1,m1)
> ans =
>     1     1
>     1     1
>
>>> corrcoef(m1,m2)
> ans =
>    1.0000    0.2125
>    0.2125    1.0000
>
> So, if anything in Matlab is to be taken as a role-model,
> I would advocate for the Stats Toolbox function corr().
>
> Another argument for this corr() behavior is that
> the R function cor() behaves the same way.
> I guess R is the gold-standard for stats computing.
>
> Here are the above operations in R:
>
>> m1 <- matrix(c(1, -1, 0, 2, 3, 4),nrow=3)
>> m1
>     [,1] [,2]
> [1,]    1    2
> [2,]   -1    3
> [3,]    0    4
>
>> m2 <- matrix(c(-1, 2, -1, 1, -1, 3),nrow=3)
>> m2
>     [,1] [,2]
> [1,]   -1    1
> [2,]    2   -1
> [3,]   -1    3
>
>> cor(m1)
>     [,1] [,2]
> [1,]  1.0 -0.5
> [2,] -0.5  1.0
>
>> cor(m1,m1)
>     [,1] [,2]
> [1,]  1.0 -0.5
> [2,] -0.5  1.0
>
>> cor(m1,m2)
>           [,1] [,2]
> [1,] -0.8660254  0.5
> [2,]  0.0000000  0.5
>
> In summary, let's copy R's cor() and Matlab's corr(),
> not Matlab's corrcoef().

that's the difference between stats and numpy/matlab generic, and
since corrcoef is in numpy it also follows numpy convention like
rowvar which always throws me off.

>>> x = np.random.randn(10,3)
>>> y = np.random.randn(10,2)
>>> from scipy import stats
>>> xs = stats.zscore(x)
>>> ys = stats.zscore(y)
>>> np.dot(xs.T, ys)/xs.shape[0]
array([[ 0.44258451,  0.42834949],
       [-0.22926899,  0.41053462],
       [-0.03316133,  0.1747719 ]])
>>> np.corrcoef(x,y, rowvar=0)[:3, -2:]
array([[ 0.44258451,  0.42834949],
       [-0.22926899,  0.41053462],
       [-0.03316133,  0.1747719 ]])

Josef


>
> Raj
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>



More information about the SciPy-User mailing list