getting a submatrix of all true
Bengt Richter
bokr at oz.net
Thu Jul 3 06:25:40 CEST 2003
On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jdhunter at ace.bsd.uchicago.edu> wrote:
>
>I have a largish data set (1000 observations x 100 floating point
>variables), and some of the of the data are missing. I want to try a
>variety of clustering, neural network, etc, algorithms on the data,
>and to keep life simple I want to reduce the dimensions of the matrix
>so that I have no missing values, since not all the algorithms are
>able to handle them and there is sufficient redundancy in the
>variables that I can afford to lose some.
>
>I am currently using a hack that works, but it makes me wonder if
>there is an optimal solution. I define optimal as the removal of rows
>and columns such that there are no missing values and
>max(numRows*numCols).
>
>My current approach is to drop rows (observations) that have more than
>some prespecified number of missing variables, and then drop the
>columns (variables) of the reduced data set that have any missing
>values. I chose the threshold for dropping a row by eyeballing the
>distribution of number of missing variables per observation, pick a
>number on the low end of the distribution, and dropping the rows that
>exceed the threshold.
>
>Another way of formulating the question: for a sparse boolean matrix
>(sparse on True), what is the optimal way to remove rows and columns
>so that the total number of elements in the matrix is maximal and
>there are no True values left.
>
>
>Example:
>
> 0 0 0
> 0 0 0 candidate sub matrix has 12 elements
> 0 0 0
> 0 0 0
>
>1 0 0 0 1
>0 0 0 0 0 0 0 0 0 0
>0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
>0 0 0 0 0 0 0 0 0 0
>0 0 1 0 0
>
> 0 0
> 0 0 candidate submatrix has 8 elements
> 0 0
> 0 0
>
>I want to programatically extract the 15 element matrix
If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)
1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x
Or do they have to be adjacent?
>
>Following the approach described above, I get the desired answer in
>the example below, though this is a hack solution and I have the
>feeling there is a better one.
>
> from Numeric import nonzero, array, take, sum
>
> X = array([[1, 0, 0, 0, 1],
> [0, 0, 0, 0, 0],
> [0, 0, 0, 0, 0],
> [0, 0, 0, 0, 0],
> [0, 0, 1, 0, 0]])
>
> goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
> X = take(X, goodObsInd) # drop the bad
>
> goodVarInd = nonzero(sum(X)==0) # variables with no missing data
> X = take(X, goodVarInd, 1 ) # drop the bad variables
>
> print X
>
Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?
====< submatrix.py >================================
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]
def optimrcs(mx, rows=[], cols=[]):
if not rows or not cols: return 0,[],[]
maxscore = 0
for r in rows:
if not mx[r].count(1): continue
for c in cols:
if not mx[r][c]: continue
# eval column submatrix vs row submatrix
colsxthis = cols[:]; colsxthis.remove(c)
colscore, rset, cset = optimrcs(mx, rows, colsxthis)
if colscore>maxscore:
maxscore, maxrows, maxcols = colscore, rset, cset
rowsxthis = rows[:]; rowsxthis.remove(r)
rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
if rowscore >= maxscore:
maxscore, maxrows, maxcols = rowscore, rset, cset
if not maxscore:
return len(rows)*len(cols), rows, cols
return maxscore, maxrows,maxcols
if __name__ == '__main__':
score, rowsel, colsel = optimrcs(X, range(5), range(5))
print 'Score = %s, rows = %s, cols = %s' % (score,rowsel,colsel)
print
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
====================================================
Running it results thus:
[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]
1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>
Regards,
Bengt Richter
More information about the Python-list
mailing list