Match items in large list

Terry Reedy tjreedy at udel.edu
Thu Feb 12 12:36:20 EST 2009


Fisherking wrote:
> Hi,
> 
> I hope you can help me with this optimizing problem!
> I have a large list of dictionaries (about 250 000 of them). Two or
> more of these dictionaries contains the same data (not more than five
> or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
> 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
> (the dictionaries are really containg phone numbers, duration time and
> call time.)
> 
> Which are the best way of searching through the list and extract the
> items that are the same. In the first run I want to find every item
> that are the same as {'a':1,'b':'2','c':3}, the second {'d':
> 4,'e':'5','f':6} etc. The list are not read-only and I can pop them
> out of the list when I have found them!

Popping items out of the middle of a list is a slow O(n) operation.
MRAB's set of tuples is what I would have suggested.

> 
> (How can I found items that are almost the same? e.g. {'a':
> 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )
> 
> In the second part of this program I need to take each of these items
> (similar items are treated as one) and find these in a database that
> contains 2.5M posts!
> 
> The python version I am bound to is Python 2.3

For that, dict of tuples (with fake value == None) might be faster than 
the add-on set module.




More information about the Python-list mailing list