[SciPy-user] Combinatoric/Statistics: length of longest sequence

Dr. Hans Georg Krauthaeuser hgk at et.uni-magdeburg.de
Mon Oct 30 03:03:04 EST 2006


A. M. Archibald wrote:
> On 29/10/06, Dr. Hans Georg Krauthaeuser <hgk at et.uni-magdeburg.de> wrote:
>> Dear All,
>>
>> I have a 2d-array A of p-values (coming from pairwise statistical tests
>> of N datasets). N is typically in the range of 100-1000. Now, given a
>> certain threshold, I want to find the longest subset S of range(N), so
>> that A[i][j]>threshold for all i,j from S.
> 
> It's not quite clear, here, whether you want the longest *contiguous*
> subset ({2,3,4,5,6}) or the largest arbitrary subset ({2,17,21,22}).

Oh well, sorry for being not precise enough. I'm searching for the
largest arbitrary subset.

> 
>> Unfortunately, the array A is not 'sparse', i.e. a lot of pairs (i,j)
>> will fulfill A[i][j]>threshold (typically ~50% of the elements).
>>
>> All I can think of are brute force algorithms that probably will run
>> forever because of the large number of possible combinations to check.
> 
> If you want the longest contiguous subset, for N=1000 that's only of
> order a million operations for a brute-force algorithm (for each
> starting element, see how long the run is), so unless you want to do
> it lots of times I wouldn't worry. If you do, some Numeric cleverness
> might allow you to move some of the heavy lifting to C, though for
> simple indexing calculations pure python can be plenty fast.
> 
> If you want the largest arbitrary subset, that's going to be
> expensive, and it bears an unnerving resemblance to the subset-sum
> problem. I can imagine an expensive  recursive algorithm that would do
> it without too much pain, though the space to search is so huge you're
> liable to be taking a very long time. However you go about it, you may
> be better off using python's set datatype (for element i, form the set
> of all elements j for which A[i][j]>threshold).
> 
> I've attached a demonstration program that solves both interpretations
> of your problem; the difficult one appears to bog down seriously
> around N=200 (with p=0.5) but the easy one runs into memory problems
> before taking too long.
> 
> A. M. Archibald
> 
> 

Thanks a lot for your hints and for the example code. I will check what
I can do with it. I'll be back when I have some results.


Regards
Hans Georg




More information about the SciPy-User mailing list