Match items in large list

Thu Feb 12 11:37:10 EST 2009

Fisherking wrote:
> Hi,
> 
> I hope you can help me with this optimizing problem!
> I have a large list of dictionaries (about 250 000 of them). Two or
> more of these dictionaries contains the same data (not more than five
> or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
> 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
> (the dictionaries are really containg phone numbers, duration time and
> call time.)
> 
> Which are the best way of searching through the list and extract the
> items that are the same. In the first run I want to find every item
> that are the same as {'a':1,'b':'2','c':3}, the second {'d':
> 4,'e':'5','f':6} etc. The list are not read-only and I can pop them
> out of the list when I have found them!
> 
> (How can I found items that are almost the same? e.g. {'a':
> 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )
> 
> In the second part of this program I need to take each of these items
> (similar items are treated as one) and find these in a database that
> contains 2.5M posts!
> 
> The python version I am bound to is Python 2.3
> 
The best way of looking for duplicates is to make a dict or set of the
items. Unfortunately a dict isn't hashable, so it can't be a key.
Therefore I suggest you convert the dicts into tuples, which are
hashable:

counts = {}
for d in my_list:
     d_as_list = d.items()
     # Standardise the order so that identical dicts become identical 
tuples.
     d_as_list.sort()
     key = tuple(d_as_list)
     if key in counts:
         counts[key] += 1
     else:
         counts[key] = 1