[Tutor] Algorithm for combination analysis suggestion.

Thu Feb 11 19:59:54 CET 2010

Hi Tutors, 

I am looking for the proper approach regarding the analysis of a dictionary of combinations I have.

What I need to do is read from a supplied text file that has a unique ID and that unique ID's associated combination of elements. So let's say I have the following lines in a text file (real file could be millions of lines):

"ID"    "Elements"
1    'A, B, C, D'
2    'A, D'
3    'D, E'
4    'A, D'
5    'A, B'
6    'A, C, D'

and I do something like...

combinationDict = {}
for line in file:
    data = line.split('\t')
    comb = tuple(data[1].split(','))
    if comb not in combinationDict:
        combinationDict[comb] = 1
    else:
        combination[comb] +=1

Now after I read all of the data I end up with a dictionary with the combination as the key and the associated total qty as its value.

print combinationDict
{('A','B','C','D'):1, ('A','D'):2, ('D','E'):1, ('A','B'):1, ('A', 'C', 'D'):1}

What I am looking for is a starting point for a solution in python to analyze the combination list so that I can determine for example that ('A', 'D') is the most popular combination and then determining how many other combinations in the dictionary contain this combination. 

I would like to incorporate some parameters so for example the combination ('A','B','C','D') and ('A', 'C', 'D') contain ('A','D') so they are valid but I could also say that as long as one element is contained in a combination it is valid as well provided I add no more than one additional item to the combination. If I apply this logic then ('D','E') can ('A','B') can contain ('A', 'D') and if I apply this to the combination dictionary I have:

{('B','C', ('A', 'D')):1, ('A','D'):2, ('E', ('A', 'D')):1, ('B', ('A', 'D')):1, ('C', ('A', 'D')):1}

which I could then query the keys for ('A', 'D') inclusion to get a total of 4 for ('A', 'D').

I hope this isn't too long and confusing but I am looking for an approach where I can analyze for the highest quantity of combinations and then iterate through the dictionary substituting those combinations that were determined a "highest qty" combination into other low qty combinations when valid.

I was hoping to have parameters to qualify a high qty combination (e.g. every combination with qty above 10,000) with the highest quantity of that determined set taking precedence for substitution for the first pass then moving on to the next highest combination for the second pass of substitution etc.. The other parameter would be for the combination that would receive a substitution whereby I might say that I can only substitute if a substitution results in only one additional (superfluous) value being added to the combination existing low qty combination.

I have looked around and this sounds like it might be similar to a packing problem and in particular the knapsack problem but I can't seem to wrap my head around an approach for this in python. I am not looking for a solution just some guidance on a starting point or perhaps libraries that may be helpful.

Thank you.

_________________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100211/fb5c9462/attachment.htm>