Match items in large list
MRAB
google at mrabarnett.plus.com
Thu Feb 12 11:37:10 EST 2009
Fisherking wrote:
> Hi,
>
> I hope you can help me with this optimizing problem!
> I have a large list of dictionaries (about 250 000 of them). Two or
> more of these dictionaries contains the same data (not more than five
> or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
> 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
> (the dictionaries are really containg phone numbers, duration time and
> call time.)
>
> Which are the best way of searching through the list and extract the
> items that are the same. In the first run I want to find every item
> that are the same as {'a':1,'b':'2','c':3}, the second {'d':
> 4,'e':'5','f':6} etc. The list are not read-only and I can pop them
> out of the list when I have found them!
>
> (How can I found items that are almost the same? e.g. {'a':
> 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )
>
> In the second part of this program I need to take each of these items
> (similar items are treated as one) and find these in a database that
> contains 2.5M posts!
>
> The python version I am bound to is Python 2.3
>
The best way of looking for duplicates is to make a dict or set of the
items. Unfortunately a dict isn't hashable, so it can't be a key.
Therefore I suggest you convert the dicts into tuples, which are
hashable:
counts = {}
for d in my_list:
d_as_list = d.items()
# Standardise the order so that identical dicts become identical
tuples.
d_as_list.sort()
key = tuple(d_as_list)
if key in counts:
counts[key] += 1
else:
counts[key] = 1
More information about the Python-list
mailing list