getting a submatrix of all true
John Hunter
jdhunter at ace.bsd.uchicago.edu
Mon Jul 7 18:47:31 EDT 2003
>>>>> "John" == John Hunter <jdhunter at ace.bsd.uchicago.edu> writes:
John> I think I may adapt your algorithm to terminate when the
John> percent missing falls below some threshold, removing the
John> requirement that *all* missing values be removed. This will
John> drop the worst offenders observation or variable wise. Then
John> I can use a regression approach to fill in the remaining
John> ones.
I think I have found a major improvement on your algorithm Jon, by
changing the rule which determines whether to drop a row or a column
after you have found the worst offenders for each. The original
approach was to keep the one with the most valid (zero) elements,
which makes sense since that is the quantity you are trying to
maximize. But consider the case where there is correlation so that
true values on a row or column tend to be correlated with other true
values on that row or column, as in
X = 1 0
1 0
1 0
1 0
1 0
0 0
0 0
Your algorithm would keep the column, since keeping a row gives you
only one zero whereas keeping a column gives you 2. By looking at the
*fraction* of ones in the worst offending row and column, you can do
much better. By deleting the row or column with the greatest fraction
of ones, you reduce the likelihood that you'll encounter a one in
future iterations. Using this approach, I was able to find a 893x67
submatrix with a score of 59831 versus the 462x81 submatrix with a
score of 37422.
I have also added thresholds to the code, so that you can terminate
with a fraction of missing values remaining. By setting the
thresholds to zero, you get the special case of no missing values.
The example below also includes a no-copy optimization that Jon
emailed me off list. The algorithm run time for a 1073x82 matrix is
around 30ms on my 700MHz machine.
John Hunter
from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *
def trim_missing(a, thresh=(0.05,0.05) ):
"""
a is a numObservations by numVariables boolean matrix
thresh = (othresh, vthresh) are the observation and variable
thresholds that trigger deletion
remove any rows where percent missing vars > vthresh
remove any cols where percent missing obs > othresh
This is done iteratively rechecking the percent missing on each
iteration or row or col removal, with the largest offender removed
on each iteration
return value is score, rowInd, colInd where the two ind arrays
are the indices of the rows and cols that were kept
"""
othresh, vthresh = thresh # theshold for deletion
dr=zeros(a.shape[0]) # Arrays to note deleted rows...
dc=zeros(a.shape[1]) # ... and columns
sizeleft=list(a.shape) # Remaining dimensions
sr=sum(a,1) # Column scores = ij's to remove
sc=sum(a,0) # Row scores
while 1: # Keep deleting till none left
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
fracRow = sr[mr]/sizeleft[1]
fracCol = sc[mc]/sizeleft[0]
if fracRow<=vthresh and fracCol<=othresh:
break # stop when missing criteria are satisfied
if fracRow-vthresh<fracCol-othresh:
sr=sr-a[:,mc] # rows reduced by contents of column
sc[mc]=0 # sum of column now zero
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
# deleting the row
sc=sc-a[mr,:] # columns reduced by contents of row
sr[mr]=0 # sum of row is now zero
dr[mr]=1 # tags row as deleted
sizeleft[0]-=1
score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
dr=compress(dr==0,range(a.shape[0])) # gives return format of
dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt
return score, dr, dc
url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
for line in urlopen(url).readlines()]
a= array(L)
start = time.time()
score, goodRows, goodCols = trim_missing(a, thresh=(0.0, 0.0))
end = time.time()
print end-start
print 'Original size', a.shape
print 'Number of non-zeros in', sum(ravel(a))
print 'Reduced size', len(goodRows), len(goodCols)
print 'Score', score
More information about the Python-list
mailing list