Identifying Colinear Columns of a Matrix
Hello All, I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.... But they do not tell me which columns are causing the problem. For example: zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ], [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ], [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ], [ 3. , 8. , 0. , 5. , 0. ]]) How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0? Any input would be greatly appreciated. Thanks much, MJ
As you will note, since most of the functions work on rows, the matrix in question has been transposed. From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas Sent: Friday, August 26, 2011 10:11 AM To: 'Discussion of Numerical Python' Subject: [Numpy-discussion] Identifying Colinear Columns of a Matrix Hello All, I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.... But they do not tell me which columns are causing the problem. For example: zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ], [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ], [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ], [ 3. , 8. , 0. , 5. , 0. ]]) How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0? Any input would be greatly appreciated. Thanks much, MJ
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas
Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number…. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels. http://en.wikipedia.org/wiki/Variance_inflation_factor Maybe there are other ways with decompositions. I'd be happy to hear about them. Please post back if you write any code to do this. Skipper
I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.
Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
MJ
-----Original Message-----
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas
Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels. http://en.wikipedia.org/wiki/Variance_inflation_factor Maybe there are other ways with decompositions. I'd be happy to hear about them. Please post back if you write any code to do this. Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I wonder if my last statement is essentially the only answer... which I wanted to avoid...
Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity? I just wanted to avoid the whole k! algorithm.
MJ
-----Original Message-----
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas
Sent: Friday, August 26, 2011 10:35 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.
Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
MJ
-----Original Message-----
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas
Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels. http://en.wikipedia.org/wiki/Variance_inflation_factor Maybe there are other ways with decompositions. I'd be happy to hear about them. Please post back if you write any code to do this. Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Aug 26, 2011 at 7:41 PM, Mark Janikas
I wonder if my last statement is essentially the only answer... which I wanted to avoid...
Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity? I just wanted to avoid the whole k! algorithm.
This is a completely naive, off-the-top of my head reply, so most likely completely wrong. But wouldn't a Gram-Schmidt type process let you identify things here? You're effectively looking for n vectors that belong to an m-dimensional subspace with n>m. As you walk through the G-S process you could probably track the projections and identify when one of the vectors in the m-n set is 'emptied out' by the G-S projections, and would have the info of what it projected into. I don't remember the details of G-S so perhaps there's a really obvious reason why the above is dumb and doesn't work. But just in case it gets you thinking in the right direction... (and I'll learn something from the corrections) Cheers, f
On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas
I wonder if my last statement is essentially the only answer... which I wanted to avoid...
Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity? I just wanted to avoid the whole k! algorithm.
MJ
-----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto: numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas Sent: Friday, August 26, 2011 10:35 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
1. Calc the correlation coefficient of the matrix (w/o the intercept) 2. Return the diagonal of the inversion of the correlation matrix in step 1.
Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
MJ
-----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto: numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold Sent: Friday, August 26, 2011 10:28 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas
wrote: provide a very large condition number.. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Maybe there are other ways with decompositions. I'd be happy to hear about them.
Please post back if you write any code to do this.
Why not svd? In [13]: u,d,v = svd(zt) In [14]: d Out[14]: array([ 1.01307066e+01, 1.87795095e+00, 3.03454566e-01, 3.29253945e-16]) In [15]: u[:,3] Out[15]: array([ 0.57735027, -0.57735027, -0.57735027, 0. ]) In [16]: dot(u[:,3], zt) Out[16]: array([ -7.77156117e-16, -6.66133815e-16, -7.21644966e-16, -7.77156117e-16, -8.88178420e-16]) Chuck
Charles! That looks like it could be a winner! It looks like you always choose the last column of the U matrix and ID the columns that have the same values? It works when I add extra columns as well! BTW, sorry for my lack of knowledge... but what was the point of the dot multiply at the end? That they add up to essentially zero, indicating singularity? Thanks so much!
MJ
From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Charles R Harris
Sent: Friday, August 26, 2011 11:04 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas
Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels. http://en.wikipedia.org/wiki/Variance_inflation_factor Maybe there are other ways with decompositions. I'd be happy to hear about them. Please post back if you write any code to do this. Why not svd? In [13]: u,d,v = svd(zt) In [14]: d Out[14]: array([ 1.01307066e+01, 1.87795095e+00, 3.03454566e-01, 3.29253945e-16]) In [15]: u[:,3] Out[15]: array([ 0.57735027, -0.57735027, -0.57735027, 0. ]) In [16]: dot(u[:,3], zt) Out[16]: array([ -7.77156117e-16, -6.66133815e-16, -7.21644966e-16, -7.77156117e-16, -8.88178420e-16]) Chuck
On Fri, Aug 26, 2011 at 12:38 PM, Mark Janikas
Charles! That looks like it could be a winner! It looks like you always choose the last column of the U matrix and ID the columns that have the same values? It works when I add extra columns as well! BTW, sorry for my lack of knowledge… but what was the point of the dot multiply at the end? That they add up to essentially zero, indicating singularity? Thanks so much!
The indicator of collinearity is the singular value in d, the corresponding column in u represent the linear combination of rows that are ~0, the corresponding row in v represents the linear combination of columns that are ~0. If you have several combinations that are ~0, of course you can add them together and get another. Basically, if you take the rows in v corresponding to small singular values, you get a basis for the for the null space of the matrix, the corresponding columns in u are a basis for the orthogonal complement of the range of the matrix. If that is getting a bit technical you can just play around with things. <snip> Chuck
On Fri, Aug 26, 2011 at 2:57 PM, Charles R Harris
On Fri, Aug 26, 2011 at 12:38 PM, Mark Janikas
wrote: Charles! That looks like it could be a winner! It looks like you always choose the last column of the U matrix and ID the columns that have the same values? It works when I add extra columns as well! BTW, sorry for my lack of knowledge… but what was the point of the dot multiply at the end? That they add up to essentially zero, indicating singularity? Thanks so much!
The indicator of collinearity is the singular value in d, the corresponding column in u represent the linear combination of rows that are ~0, the corresponding row in v represents the linear combination of columns that are ~0. If you have several combinations that are ~0, of course you can add them together and get another. Basically, if you take the rows in v corresponding to small singular values, you get a basis for the for the null space of the matrix, the corresponding columns in u are a basis for the orthogonal complement of the range of the matrix. If that is getting a bit technical you can just play around with things.
Interpretation is a bit difficult if there are more than one zero eigenvalues
zt2 = np.vstack((zt, zt[2,:] + zt[3,:])) zt2 array([[ 1. , 1. , 1. , 1. , 1. ], [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ], [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ], [ 3. , 8. , 0. , 5. , 0. ], [ 3.75, 8.9 , 0.8 , 5.75, 0.5 ]]) u,d,v = np.linalg.svd(zt2) d array([ 1.51561431e+01, 1.91327688e+00, 3.25113875e-01, 1.05664844e-15, 5.29054218e-16]) u[:,-2:] array([[ 0.59948553, -0.12496837], [-0.59948553, 0.12496837], [-0.51747833, -0.48188813], [ 0.0820072 , -0.60685651], [-0.0820072 , 0.60685651]])
Josef
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Aug 26, 2011 at 1:41 PM, Mark Janikas
I wonder if my last statement is essentially the only answer... which I wanted to avoid...
Should I just use combinations of the columns and try and construct the corrcoef() (then ID whether NaNs are present), or use the condition number to ID the singularity? I just wanted to avoid the whole k! algorithm.
MJ
-----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas Sent: Friday, August 26, 2011 10:35 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
I actually use the VIF when the design matrix can be inverted.... I do it the quick and dirty way as opposed to the step regression:
1. Calc the correlation coefficient of the matrix (w/o the intercept) 2. Return the diagonal of the inversion of the correlation matrix in step 1.
Again, the problem lies in the multiple column relationship... I wouldn't be able to run sub regressions at all when the columns are perfectly collinear.
MJ
-----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Skipper Seabold Sent: Friday, August 26, 2011 10:28 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas
wrote: Hello All,
I am trying to identify columns of a matrix that are perfectly collinear. It is not that difficult to identify when two columns are identical are have zero variance, but I do not know how to ID when the culprit is of a higher order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide a very large condition number.. But they do not tell me which columns are causing the problem. For example:
zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ],
[ 0.25, 0.1 , 0.2 , 0.25, 0.5 ],
[ 0.75, 0.9 , 0.8 , 0.75, 0.5 ],
[ 3. , 8. , 0. , 5. , 0. ]])
How can I identify that columns 0,1,2 are the issue because: column 1 + column 2 = column 0?
Any input would be greatly appreciated. Thanks much,
The way that I know to do this in a regression context for (near perfect) multicollinearity is VIF. It's long been on my todo list for statsmodels.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Maybe there are other ways with decompositions. I'd be happy to hear about them.
Please post back if you write any code to do this.
Partial answer in a different context. I have written a function that only adds columns if they maintain invertibility, using brute force: add each column sequentially, check whether the matrix is singular. Don't add the columns that already included as linear combination. (But this doesn't tell which columns are in the colinear vector.) I did this for categorical variables, so sequence was predefined. Just finding a non-singular subspace would be easier, PCA, SVD, or scikits.learn matrix decomposition (?). (factor models and Johansen's cointegration tests are also just doing matrix decomposition that identify subspaces) Maybe rotation in Factor Analysis is able to identify the vectors, but I don't have much idea about that. Josef
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Charles R Harris
-
Fernando Perez
-
josef.pktd@gmail.com
-
Mark Janikas
-
Skipper Seabold