getting a submatrix of all true

John Hunter jdhunter at ace.bsd.uchicago.edu
Mon Jul 7 12:00:11 EDT 2003


>>>>> "Jon" == Jon  <wright at esrf.fr> writes:

    Jon> Sadly I can only give you a method which is certainly not
    Jon> optimal, but at least finds something which is probably not
    Jon> too bad, and fairly quickly. Uses the Numeric module to speed
    Jon> up, so that a few thousand by a few hundred is quickly
    Jon> computable. I'm curious to know if it does much better or
    Jon> worse than other algorithms.

Hi Jon, thanks for the suggestion.  For my data set, your solution
performed a little better size-wise than my poor man's approach, which
I gave in the original post.  And it is certainly fast enough.  I
liked your idea of zero-ing out the elements when you made a decision
to delete a row or column.  When I was thinking about solutions using
Numeric, I kept running up against the problem of deleting a row or
column (which entails a whole array copy so is expensive).

If anyone wants to try their algorithm on "real world data" which,
unlike the ones discussed here, have lots of correlation between
missing rows and columns that might be exploited, I have uploaded the
missing data matrix to a web server.  It can be loaded with:

  from urllib import urlopen
  url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
  L = [ [ int(val) for val in line.split()] 
        for line in urlopen(url).readlines()]
  # L is 1072x83

I think I may adapt your algorithm to terminate when the percent
missing falls below some threshold, removing the requirement that
*all* missing values be removed.  This will drop the worst offenders
observation or variable wise.  Then I can use a regression approach to
fill in the remaining ones.

    Jon> Interesting problem!

I agree.  There are lots of extensions to the original formulation
that are relevant to the missing value problem.  As we have seen in
the posts here, there are good exact solutions for smallish matrices,
but for larger ones a statistical approach is required.  The right
statistical solution surely depends on the correlation structure of
the matrix.

Also, it would be nice to generalize to higher dimensions (eg 3D
matrices).  For my problem, I have N observations of M variables over
D days.  So this is the 3D version of the 2D problem above.  Your
solution is readily adaptable to this case.  More generally, you might
want to impose extra requirements, like, no days can be dropped, or at
most 10% of the days can be dropped, or if you want to compute changes
in variables over days, that no days can be dropped.   

The more general formulation is something like

  Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
  matrix C, what is the largest submatrix of X of dimensions (N1, M1,
  P1, ...) such that the fraction of remaining dimensions satisfies

    (N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)  

  and the total percent true <= p.

The original problem is a special case of this where alpha=0, beta=0,
p=0 and C unspecified.

Anyone looking for a dissertation project :-) ?

John Hunter





More information about the Python-list mailing list