[Tutor] Algorithm for combination analysis suggestion.
Gerard Flanagan
grflanagan at gmail.com
Fri Feb 12 19:35:10 CET 2010
Matthew Matson wrote:
>
> Hi Tutors,
>
> I am looking for the proper approach regarding the analysis of a
> dictionary of combinations I have.
>
> What I need to do is read from a supplied text file that has a unique ID
> and that unique ID's associated combination of elements. So let's say I
> have the following lines in a text file (real file could be millions of
> lines):
>
> "ID" "Elements"
> 1 'A, B, C, D'
> 2 'A, D'
> 3 'D, E'
> 4 'A, D'
> 5 'A, B'
> 6 'A, C, D'
>
> and I do something like...
>
> combinationDict = {}
> for line in file:
> data = line.split('\t')
> comb = tuple(data[1].split(','))
> if comb not in combinationDict:
> combinationDict[comb] = 1
> else:
> combination[comb] +=1
>
> Now after I read all of the data I end up with a dictionary with the
> combination as the key and the associated total qty as its value.
>
> print combinationDict
> {('A','B','C','D'):1, ('A','D'):2, ('D','E'):1, ('A','B'):1, ('A', 'C',
> 'D'):1}
>
> What I am looking for is a starting point for a solution in python to
> analyze the combination list so that I can determine for example that
> ('A', 'D') is the most popular combination and then determining how many
> other combinations in the dictionary contain this combination.
>
> I would like to incorporate some parameters so for example the
> combination ('A','B','C','D') and ('A', 'C', 'D') contain ('A','D') so
> they are valid but I could also say that as long as one element is
> contained in a combination it is valid as well provided I add no more
> than one additional item to the combination. If I apply this logic then
> ('D','E') can ('A','B') can contain ('A', 'D') and if I apply this to
> the combination dictionary I have:
>
> {('B','C', ('A', 'D')):1, ('A','D'):2, ('E', ('A', 'D')):1, ('B', ('A',
> 'D')):1, ('C', ('A', 'D')):1}
>
> which I could then query the keys for ('A', 'D') inclusion to get a
> total of 4 for ('A', 'D').
>
> I hope this isn't too long and confusing but I am looking for an
> approach where I can analyze for the highest quantity of combinations
> and then iterate through the dictionary substituting those combinations
> that were determined a "highest qty" combination into other low qty
> combinations when valid.
>
> I was hoping to have parameters to qualify a high qty combination (e.g.
> every combination with qty above 10,000) with the highest quantity of
> that determined set taking precedence for substitution for the first
> pass then moving on to the next highest combination for the second pass
> of substitution etc.. The other parameter would be for the combination
> that would receive a substitution whereby I might say that I can only
> substitute if a substitution results in only one additional
> (superfluous) value being added to the combination existing low qty
> combination.
>
> I have looked around and this sounds like it might be similar to a
> packing problem and in particular the knapsack problem but I can't seem
> to wrap my head around an approach for this in python. I am not looking
> for a solution just some guidance on a starting point or perhaps
> libraries that may be helpful.
>
> Thank you.
>
>
Hey, never heard of commas! I'm out of breath after that ;-)
It sounds something like Linear Programming - not that I know much about
it, but maybe asking on other lists, eg. scipy, might turn up some kind
of standard approach to this kind of problem.
One idea for the 'is a combination contained in another combination'
issue is using bitmasks. Create a bitmask for each combination, then C1
is a component of C2 if C1 & C2 == C1. (Hope i've got that right!) Some
code doodling below shows the idea.
Regards
G.F.
-----------------------------------------------------
from collections import defaultdict
data = """
1 A, B, C, D
2 A, D
3 D, E
4 A, D
5 A, B
6 A, C, D
7 E
""".splitlines()
atoms = {}
key2mask = {}
mask2key = {}
combos = defaultdict(int)
for line in data:
if line:
key = ''
mask = 0
for atom in line[5:].split(','):
atom = atom.strip()
key += atom
mask |= atoms.setdefault(atom, 1 << len(atoms))
key2mask[key] = mask
mask2key[mask] = key
combos[mask] += 1
for combo in ['AB', 'AD', 'DE', 'E']:
mask = key2mask[combo]
print
for c in combos.iterkeys():
if mask & c == mask:
print '%s is a component of %s' % (combo, mask2key[c])
print 'there are %s %ss' % (combos[mask], combo)
------------------------------------------------------
def xuniqueCombinations(items, n):
if n==0:
yield []
else:
for i in xrange(len(items)):
for cc in xuniqueCombinations(items[i+1:],n-1):
yield [items[i]]+cc
def allcombos():
for k in range(1,6):
for c in xuniqueCombinations('ABCDE', k):
yield c
-------------------------------------------------------
More information about the Tutor
mailing list