[Tutor] Algorithm for combination analysis suggestion.

Gerard Flanagan grflanagan at gmail.com
Fri Feb 12 19:35:10 CET 2010


Matthew Matson wrote:
> 
> Hi Tutors,
> 
> I am looking for the proper approach regarding the analysis of a 
> dictionary of combinations I have.
> 
> What I need to do is read from a supplied text file that has a unique ID 
> and that unique ID's associated combination of elements. So let's say I 
> have the following lines in a text file (real file could be millions of 
> lines):
> 
> "ID"    "Elements"
> 1    'A, B, C, D'
> 2    'A, D'
> 3    'D, E'
> 4    'A, D'
> 5    'A, B'
> 6    'A, C, D'
> 
> and I do something like...
> 
> combinationDict = {}
> for line in file:
>     data = line.split('\t')
>     comb = tuple(data[1].split(','))
>     if comb not in combinationDict:
>         combinationDict[comb] = 1
>     else:
>         combination[comb] +=1
> 
> Now after I read all of the data I end up with a dictionary with the 
> combination as the key and the associated total qty as its value.
> 
> print combinationDict
> {('A','B','C','D'):1, ('A','D'):2, ('D','E'):1, ('A','B'):1, ('A', 'C', 
> 'D'):1}
> 
> What I am looking for is a starting point for a solution in python to 
> analyze the combination list so that I can determine for example that 
> ('A', 'D') is the most popular combination and then determining how many 
> other combinations in the dictionary contain this combination.
> 
> I would like to incorporate some parameters so for example the 
> combination ('A','B','C','D') and ('A', 'C', 'D') contain ('A','D') so 
> they are valid but I could also say that as long as one element is 
> contained in a combination it is valid as well provided I add no more 
> than one additional item to the combination. If I apply this logic then 
> ('D','E') can ('A','B') can contain ('A', 'D') and if I apply this to 
> the combination dictionary I have:
> 
> {('B','C', ('A', 'D')):1, ('A','D'):2, ('E', ('A', 'D')):1, ('B', ('A', 
> 'D')):1, ('C', ('A', 'D')):1}
> 
> which I could then query the keys for ('A', 'D') inclusion to get a 
> total of 4 for ('A', 'D').
> 
> I hope this isn't too long and confusing but I am looking for an 
> approach where I can analyze for the highest quantity of combinations 
> and then iterate through the dictionary substituting those combinations 
> that were determined a "highest qty" combination into other low qty 
> combinations when valid.
> 
> I was hoping to have parameters to qualify a high qty combination (e.g. 
> every combination with qty above 10,000) with the highest quantity of 
> that determined set taking precedence for substitution for the first 
> pass then moving on to the next highest combination for the second pass 
> of substitution etc.. The other parameter would be for the combination 
> that would receive a substitution whereby I might say that I can only 
> substitute if a substitution results in only one additional 
> (superfluous) value being added to the combination existing low qty 
> combination.
> 
> I have looked around and this sounds like it might be similar to a 
> packing problem and in particular the knapsack problem but I can't seem 
> to wrap my head around an approach for this in python. I am not looking 
> for a solution just some guidance on a starting point or perhaps 
> libraries that may be helpful.
> 
> Thank you.
> 
> 

Hey, never heard of commas! I'm out of breath after that ;-)

It sounds something like Linear Programming - not that I know much about 
it, but maybe asking on other lists, eg. scipy, might turn up some kind 
of standard approach to this kind of problem.

One idea for the 'is a combination contained in another combination' 
issue is using bitmasks. Create a bitmask for each combination, then C1 
is a component of C2 if C1 & C2 == C1. (Hope i've got that right!) Some 
code doodling below shows the idea.


Regards

G.F.

-----------------------------------------------------

from collections import defaultdict

data = """
1    A, B, C, D
2    A, D
3    D, E
4    A, D
5    A, B
6    A, C, D
7    E
""".splitlines()

atoms = {}
key2mask = {}
mask2key = {}
combos = defaultdict(int)

for line in data:
     if line:
         key = ''
         mask = 0
         for atom in line[5:].split(','):
             atom = atom.strip()
             key += atom
             mask |= atoms.setdefault(atom, 1 << len(atoms))
         key2mask[key] = mask
         mask2key[mask] = key
         combos[mask] += 1

for combo in ['AB', 'AD', 'DE', 'E']:
     mask = key2mask[combo]
     print
     for c in combos.iterkeys():
         if mask & c == mask:
             print '%s is a component of %s' % (combo, mask2key[c])
     print 'there are %s %ss' % (combos[mask], combo)

------------------------------------------------------

def xuniqueCombinations(items, n):
     if n==0:
         yield []
     else:
         for i in xrange(len(items)):
             for cc in xuniqueCombinations(items[i+1:],n-1):
                 yield [items[i]]+cc

def allcombos():
     for k in range(1,6):
         for c in xuniqueCombinations('ABCDE', k):
             yield c

-------------------------------------------------------




More information about the Tutor mailing list