[Tutor] finding duplicates within a tuple of tuples
Peter Otten
__peter__ at web.de
Thu Jul 29 19:05:17 CEST 2010
Norman Khine wrote:
> hello,
>
> i have this tuple:
>
> http://paste.lisp.org/+2F4X
>
> i have this, which does what i want:
>
> from collections import defaultdict
>
> d = defaultdict(set)
> for id, url in result:
> d[url].add(id)
> for url in sorted(d):
> if len(d[url]) > 1:
> print('%d -- %s' % (len(d[url]), url))
>
> so here the code checks for duplicate urls and counts the number of
> occurences.
>
> but i am sort of stuck in that i want to now update the id of the
> related table and update the
>
> basically i have two tables:
>
> id, url
> 24715L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html'
> 24719L, 'http://aqoon.local/muesli/2-muesli-tropical-500g.html'
>
> id, tid,
> 1, 24715L
> 2, 24719L
>
> so i want to first update t(2)'s tid to t(1)'s id for each duplicate
> and then delete the row id = 24719L
You can use another dictionary that maps ids associated with the same url to
a canonical id.
from collections import defaultdict
url_table = [
(24715,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"),
(24719,"http://aqoon.local/muesli/2-muesli-tropical-500g.html"),
(24720,"http://example.com/index.html")
]
id_table = [
(1, 24715),
(2, 24719),
(3, 24720)
]
dupes = defaultdict(set)
for uid, url in url_table:
dupes[url].add(uid)
lookup = {}
for synonyms in dupes.itervalues():
if len(synonyms) > 1:
canonical = min(synonyms)
for alias in synonyms:
assert alias not in lookup
lookup[alias] = canonical
ids = [(id, lookup.get(uid, uid)) for id, uid in id_table]
print ids
urls = [(min(synonyms), url) for url, synonyms in dupes.iteritems()]
print urls
Note that if you use a database for these tables you can avoid creating
duplicates in the first place.
Peter
More information about the Tutor
mailing list