getting a submatrix of all true

Mon Jul 7 20:48:21 EDT 2003

On Mon, 07 Jul 2003 11:00:11 -0500, John Hunter <jdhunter at ace.bsd.uchicago.edu> wrote:

>>>>>> "Jon" == Jon  <wright at esrf.fr> writes:
>
>    Jon> Sadly I can only give you a method which is certainly not
>    Jon> optimal, but at least finds something which is probably not
>    Jon> too bad, and fairly quickly. Uses the Numeric module to speed
>    Jon> up, so that a few thousand by a few hundred is quickly
>    Jon> computable. I'm curious to know if it does much better or
>    Jon> worse than other algorithms.
>
>Hi Jon, thanks for the suggestion.  For my data set, your solution
>performed a little better size-wise than my poor man's approach, which
>I gave in the original post.  And it is certainly fast enough.  I
>liked your idea of zero-ing out the elements when you made a decision
>to delete a row or column.  When I was thinking about solutions using
>Numeric, I kept running up against the problem of deleting a row or
>column (which entails a whole array copy so is expensive).
>
>If anyone wants to try their algorithm on "real world data" which,
>unlike the ones discussed here, have lots of correlation between
>missing rows and columns that might be exploited, I have uploaded the
>missing data matrix to a web server.  It can be loaded with:
>
>  from urllib import urlopen
>  url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
>  L = [ [ int(val) for val in line.split()] 
>        for line in urlopen(url).readlines()]
>  # L is 1072x83
>
The statistics on this data is interesting, referring to counts of 1's:
(not very tested)

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
count:freq -- cols with count ones occurred freq times
   0:16      1:4       8:3      14:2      16:4      20:4      21:3      23:1
  25:6      27:1      28:7      32:4      41:1      59:1      60:3      62:3
  65:1      95:2      96:1     108:3     125:1     134:1     158:1     177:1
 199:1     206:2     237:1     241:1     252:1     254:1    1061:2
count:freq -- rows with count ones occurred freq times
   0:3       1:2       2:460     3:67      4:140     5:66      6:42      7:49
   8:61      9:17     10:46     11:23     12:8      13:8      14:12     15:10
  16:3      17:5      18:1      19:7      20:4      21:2      22:5      23:1
  27:2      28:2      30:2      31:1      33:1      34:1      36:2      38:1
  40:1      41:1      44:4      45:7      48:1      55:2      56:2

And sorted per descending frequency of number of cols and rows with particular counts:

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts -freq
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
freq:count -- cols with count ones occurred freq times
  16:0       7:28      6:25      4:32      4:20      4:16      4:1       3:108
   3:62      3:60      3:21      3:8       2:1061    2:206     2:95      2:14
   1:254     1:252     1:241     1:237     1:199     1:177     1:158     1:134
   1:125     1:96      1:65      1:59      1:41      1:27      1:23
freq:count -- rows with count ones occurred freq times
 460:2     140:4      67:3      66:5      61:8      49:7      46:10     42:6
  23:11     17:9      12:14     10:15      8:13      8:12      7:45      7:19
   5:22      5:17      4:44      4:20      3:16      3:0       2:56      2:55
   2:36      2:30      2:28      2:27      2:21      2:1       1:48      1:41
   1:40      1:38      1:34      1:33      1:31      1:23      1:18

There are two columns with 1061 ones out of a possible 1072, so they account for 2
in most of the row counts. Hence the most popular row count (2) mostly goes to 0.
Interestingly, 4(2) is 'way more popular than 3(1). This doesn't look like what
I'd expect from uniform overall probability of some percentage. Discounting the
 stuck columns,  no ones shouldn't be more likely than one one, since there's only
one way to get all zeroes, but there's then 81 equally probable ways to get one one.

Look at a uniform 5% on 1072 x 83 (twice to show variation).
There's a distribution with tails, centered on 5% of 1072 and 83 it seems:

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5      -counts
nr=1072, nc=83
totcc=4338, totrc=4338 => *100./(83*1072) => 4.88
count:freq -- cols with count ones occurred freq times
  35:1      39:1      40:2      41:1      42:5      43:1      45:5      46:3
  47:2      48:6      49:3      50:5      51:5      52:5      53:5      54:5
  55:5      56:5      58:3      59:4      61:1      62:2      63:1      64:1
  65:3      70:1      78:2
count:freq -- rows with count ones occurred freq times
   0:21      1:74      2:134     3:219     4:218     5:176     6:113     7:64
   8:29      9:15     10:6      11:2      12:1

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5      -counts
nr=1072, nc=83
totcc=4394, totrc=4394 => *100./(83*1072) => 4.94
count:freq -- cols with count ones occurred freq times
  37:1      40:1      41:3      43:1      44:2      45:2      46:5      47:5
  48:1      49:3      50:6      51:8      52:7      53:7      54:3      55:3
  56:4      57:3      58:3      59:1      60:1      61:1      62:1      63:1
  64:3      65:2      68:2      69:1      70:1      72:1
count:freq -- rows with count ones occurred freq times
   0:19      1:58      2:150     3:218     4:218     5:173     6:111     7:63
   8:38      9:13     10:5      11:5      12:1

>I think I may adapt your algorithm to terminate when the percent
>missing falls below some threshold, removing the requirement that
>*all* missing values be removed.  This will drop the worst offenders
>observation or variable wise.  Then I can use a regression approach to
>fill in the remaining ones.
>
>    Jon> Interesting problem!
Yes. So many, so little time ... ;-/

>
>I agree.  There are lots of extensions to the original formulation
>that are relevant to the missing value problem.  As we have seen in
>the posts here, there are good exact solutions for smallish matrices,
>but for larger ones a statistical approach is required.  The right
>statistical solution surely depends on the correlation structure of
>the matrix.
I'm pretty sure there are better exact algorithms than the brute force
one I first posted, but I haven't done a new one since, just some little
things to show things about the data, as above.
>
>Also, it would be nice to generalize to higher dimensions (eg 3D
>matrices).  For my problem, I have N observations of M variables over
>D days.  So this is the 3D version of the 2D problem above.  Your
>solution is readily adaptable to this case.  More generally, you might
>want to impose extra requirements, like, no days can be dropped, or at
>most 10% of the days can be dropped, or if you want to compute changes
>in variables over days, that no days can be dropped.   
>
I'm wondering if you might not have other considerations about which
data points to drop too. E.g., if an observation makes the difference
between being able to invert a matrix or not, you might opt to drop
several other things instead, even though in terms of count the set
would be less optimum. (Curiousness: what do your observations represent?)

>The more general formulation is something like
>
>  Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
>  matrix C, what is the largest submatrix of X of dimensions (N1, M1,
>  P1, ...) such that the fraction of remaining dimensions satisfies
>
>    (N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)  
>
>  and the total percent true <= p.
>
>The original problem is a special case of this where alpha=0, beta=0,
>p=0 and C unspecified.
>
>Anyone looking for a dissertation project :-) ?
>
>John Hunter
>

Regards,
Bengt Richter