[SciPy-User] Strange behaviour from corrcoef when calculating correlation-matrix in SciPy/NumPy.
josef.pktd at gmail.com
josef.pktd at gmail.com
Thu Mar 3 19:07:25 EST 2011
On Thu, Mar 3, 2011 at 6:34 PM, Raj <rajeev.raizada at gmail.com> wrote:
> On Mar 3, 3:18 pm, eat <e.antero.ta... at gmail.com> wrote:
>> So perhaps there does not exist any really simple and straightforward
>> translation
>> (of corrcoef) from matlab to numpy? Just as an example; how would you
>> implement case %(3 properly with numpy?
>> Regards,
>> eat
>
> It turns out that Matlab also embodies some confusion
> on this front, as it turns out that Matlab has
> two different functions for computing correlation!
>
> One is corr(), which is in the Matlab Stats Toolbox.
> This is the one that I have always used,
> and it is better-behaved, in my opinion,
> as I argue below.
>
> The other Matlab function is corrcoef().
> This is not the Stats Toolbox function, it's in the main code base.
> I didn't even know that this function existed until this thread! :-)
>
> In my view, the Matlab function corr() is the one to emulate.
> It has the very desirable property that corr(m,m) and corr(m)
> are the same.
>
> Also, its behaviour when correlating two different matrices
> is very reasonable:
> http://www.mathworks.com/help/toolbox/stats/corr.html
> RHO = corr(X,Y) returns a p1-by-p2 matrix containing the pairwise
> correlation coefficient between each pair of columns in the n-by-p1
> and n-by-p2 matrices X and Y.
>
>>> m1 = [ 1 2; -1 3; 0 4]
> m1 =
> 1 2
> -1 3
> 0 4
>
>>> corr(m1)
> ans =
> 1.0000 -0.5000
> -0.5000 1.0000
>
>>> corr(m1,m1)
> ans =
> 1.0000 -0.5000
> -0.5000 1.0000
>
>>> m2 = [ -1 1; 2 -1; -1 3]
> m2 =
> -1 1
> 2 -1
> -1 3
>
>>> corr(m1,m2)
> ans =
> -0.8660 0.5000
> 0 0.5000
>
> In contrast, the Matlab corrcoef() does weird things,
> and is almost as bad as the SciPy corrcoef() function in that regard.
>
>>> corrcoef(m1)
> ans =
> 1.0000 -0.5000
> -0.5000 1.0000
>
>>> corrcoef(m1,m1)
> ans =
> 1 1
> 1 1
>
>>> corrcoef(m1,m2)
> ans =
> 1.0000 0.2125
> 0.2125 1.0000
>
> So, if anything in Matlab is to be taken as a role-model,
> I would advocate for the Stats Toolbox function corr().
>
> Another argument for this corr() behavior is that
> the R function cor() behaves the same way.
> I guess R is the gold-standard for stats computing.
>
> Here are the above operations in R:
>
>> m1 <- matrix(c(1, -1, 0, 2, 3, 4),nrow=3)
>> m1
> [,1] [,2]
> [1,] 1 2
> [2,] -1 3
> [3,] 0 4
>
>> m2 <- matrix(c(-1, 2, -1, 1, -1, 3),nrow=3)
>> m2
> [,1] [,2]
> [1,] -1 1
> [2,] 2 -1
> [3,] -1 3
>
>> cor(m1)
> [,1] [,2]
> [1,] 1.0 -0.5
> [2,] -0.5 1.0
>
>> cor(m1,m1)
> [,1] [,2]
> [1,] 1.0 -0.5
> [2,] -0.5 1.0
>
>> cor(m1,m2)
> [,1] [,2]
> [1,] -0.8660254 0.5
> [2,] 0.0000000 0.5
>
> In summary, let's copy R's cor() and Matlab's corr(),
> not Matlab's corrcoef().
that's the difference between stats and numpy/matlab generic, and
since corrcoef is in numpy it also follows numpy convention like
rowvar which always throws me off.
>>> x = np.random.randn(10,3)
>>> y = np.random.randn(10,2)
>>> from scipy import stats
>>> xs = stats.zscore(x)
>>> ys = stats.zscore(y)
>>> np.dot(xs.T, ys)/xs.shape[0]
array([[ 0.44258451, 0.42834949],
[-0.22926899, 0.41053462],
[-0.03316133, 0.1747719 ]])
>>> np.corrcoef(x,y, rowvar=0)[:3, -2:]
array([[ 0.44258451, 0.42834949],
[-0.22926899, 0.41053462],
[-0.03316133, 0.1747719 ]])
Josef
>
> Raj
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list