[Tutor] finding duplicates within a tuple of tuples

Thu Jul 29 19:05:17 CEST 2010

Norman Khine wrote:

> hello,
> 
> i have this tuple:
> 
> http://paste.lisp.org/+2F4X
> 
> i have this, which does what i want:
> 
> from collections import defaultdict
> 
> d = defaultdict(set)
> for id, url in result:
> d[url].add(id)
> for url in sorted(d):
> if len(d[url]) > 1:
> print('%d -- %s' % (len(d[url]), url))
> 
> so here the code checks for duplicate urls and counts the number of
> occurences.
> 
> but i am sort of stuck in that i want to now update the id of the
> related table and update the
> 
> basically i have two tables:
> 
> id, url
> 24715L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html'
> 24719L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html'
> 
> id, tid,
> 1, 24715L
> 2, 24719L
> 
> so i want to first update t(2)'s tid to t(1)'s id for each duplicate
> and then delete the row id = 24719L

You can use another dictionary that maps ids associated with the same url to 
a canonical id.

from collections import defaultdict

url_table = [
(24715,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"),
(24719,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"),
(24720,"http://example.com/index.html")
]

id_table = [
(1, 24715),
(2, 24719),
(3, 24720)
]

dupes = defaultdict(set)
for uid, url in url_table:
    dupes[url].add(uid)

lookup = {}
for synonyms in dupes.itervalues():
    if len(synonyms) > 1:
        canonical = min(synonyms)
        for alias in synonyms:
            assert alias not in lookup
            lookup[alias] = canonical

ids = [(id, lookup.get(uid, uid)) for id, uid in id_table]
print ids
urls = [(min(synonyms), url) for url, synonyms in dupes.iteritems()]
print urls

Note that if you use a database for these tables you can avoid creating 
duplicates in the first place.

Peter