[Tutor] Find duplicates (using dictionaries)

Wed Feb 17 17:31:42 CET 2010

I'm relatively new at Python and I'm trying to write a function that fills a dictionary acording the following rules and (example) data:

Rules:
* No duplicate values in field1
* No duplicates values in field2 and field3 simultaneous (highest value in field4 has to be preserved)

Rec.no field1, field2, field3, field4
1. abc, def123, ghi123, 120 <-- new, insert in dictionary
2. abc, def123, ghi123, 120 <-- duplicate with 1. field4 same value. Do not insert in dictionary
3. bcd, def123, jkl125, 154 <-- new, insert in dictionary
4. efg, def123, jkl125, 175 <-- duplicate with 3 in field 2 and 3, but higher value in field4. Remove 3. from dict and replace with 4.
5. hij, ghi345, jkl125, 175 <-- duplicate field3, but not in field4. New, insert in dict.

The resulting dictionary should be:

hij     {'F2': ' ghi345', 'F3': ' jkl125', 'F4': 175}
abc     {'F2': ' def123', 'F3': ' ghi123', 'F4': 120}
efg     {'F2': ' def123', 'F3': ' jkl125', 'F4': 175}

This is wat I came up with up to now, but there is something wrong with it. The 'bcd' should have been removed. When I run it it says:

bcd     {'F2': ' def123', 'F3': ' jkl125', 'F4': 154}
hij     {'F2': ' ghi345', 'F3': ' jkl125', 'F4': 175}
abc     {'F2': ' def123', 'F3': ' ghi123', 'F4': 120}
efg     {'F2': ' def123', 'F3': ' jkl125', 'F4': 175}

Below is wat I brew (simplified). It took me some time to figure out that I was looking at the wrong values the wrong dictionary. I started again, but am ending up with a lot of dictionaries and for x in y-loops. I think there is a simpler way to do this.

Can somebody point me in the right direction and explain to me how to do this? (and maybe have an alternative for the nesting. Because I may need to compare more fields. This is only a simplified dataset).

######### not working 
def createResults(field1, field2, field3, field4):
        #check if field1 exists.
                if not results.has_key(field1):

                        if results.has_key(field2):
                                #check if field2 already exists

                                if results.has_key(field3):
                                    #check if field3 already exists
                                    #retrieve value field4
                                    existing_field4 = results[field2][F4]
                                    #retrieve value existing field1 in dict
                                    existing_field1 = results[field1]

                                    #perform highest value check
                                    if int(existing_field4) < int(field4):
                                        #remove existing record from dict.
                                        del results[existing_field1]
                                        values = {}
                                        values['F2'] = field2
                                        values['F3'] = field3
                                        values['F4'] = field4
                                        results[field1] = values
                                    else:
                                        pass
                                else:
                                    pass
                        else:
                            values = {}
                            values['F2'] = field2
                            values['F3'] = field3
                            values['F4'] = field4
                            results[field1] = values
                else:
                    pass

for line in open("file.csv"):
        field1, field2, field3, field4 = line.split(',')
        createResults(field1, field2, field3, int(field4))
    #because this is quick and dirty I had to get rid of the \n in the csv

for i in results.keys():
        print i, '\t', results[i] 
################

contents file.csv

abc, def123, ghi123, 120
abc, def123, ghi123, 120
bcd, def123, jkl125, 154
efg, def123, jkl125, 175
hij, ghi345, jkl125, 175

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100217/b8297446/attachment-0001.htm>