Match items in large list
Peter Otten
__peter__ at web.de
Thu Feb 12 07:09:05 EST 2009
Fisherking wrote:
> I hope you can help me with this optimizing problem!
> I have a large list of dictionaries (about 250 000 of them). Two or
> more of these dictionaries contains the same data (not more than five
> or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
> 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
> (the dictionaries are really containg phone numbers, duration time and
> call time.)
I'd assume the dictionaries to look like
{"phone": "0123...", "duration": 5.67, "calltime": "10:42"}
What do the different keys ("a", "b", "c") versus ("d", "e", "f") mean?
> Which are the best way of searching through the list and extract the
> items that are the same. In the first run I want to find every item
> that are the same as {'a':1,'b':'2','c':3}, the second {'d':
> 4,'e':'5','f':6} etc. The list are not read-only and I can pop them
> out of the list when I have found them!
items = [dict(a=1, b=2), dict(a=1, b=2), dict(a=3, c=2), dict(a=3, c=2.2),
dict(a=3, c=2.6)]
# 2.5
seen = set()
for row in items:
rowkey = frozenset(row.iteritems())
if rowkey in seen:
print "dupe", row
else:
seen.add(rowkey)
# 2.3 (and older); not tested
seen = {}
for row in items:
rowkey = row.items()
rowkey.sort()
rowkey = tuple(rowkey)
if rowkey in seen:
print "dupe", row
else:
seen[rowkey] = 1
> (How can I found items that are almost the same? e.g. {'a':
> 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )
If you are content with detecting similarities on a single "axis":
def window(items):
items = iter(items)
prev = items.next()
for cur in items:
yield cur, prev
prev = cur
def todict(k, v, rest):
d = dict(rest)
d[k] = v
return d
axis = "c"
d = {}
eps = 0.5
for row in items:
row = dict(row)
try:
value = row.pop(axis)
except KeyError:
pass
else:
key = frozenset(row.iteritems())
d.setdefault(key, []).append(value)
for key, values in d.iteritems():
if len(values) > 1:
for cur, prev in window(sorted(values)):
if abs(cur-prev) < eps:
print "similar rows", todict(axis, cur, key), todict(axis,
prev, key)
Otherwise you have to define a distance function and use a clustering
algorithm. Segaran's "Programming Collective Intelligence" has some
examples in Python, though the code can hardly be called idiomatic.
Peter
More information about the Python-list
mailing list