Match items in large list

Thu Feb 12 07:09:05 EST 2009

Fisherking wrote:

> I hope you can help me with this optimizing problem!
> I have a large list of dictionaries (about 250 000 of them). Two or
> more of these dictionaries contains the same data (not more than five
> or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
> 4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
> (the dictionaries are really containg phone numbers, duration time and
> call time.)

I'd assume the dictionaries to look like

{"phone": "0123...", "duration": 5.67, "calltime": "10:42"}

What do the different keys ("a", "b", "c") versus ("d", "e", "f") mean?

> Which are the best way of searching through the list and extract the
> items that are the same. In the first run I want to find every item
> that are the same as {'a':1,'b':'2','c':3}, the second {'d':
> 4,'e':'5','f':6} etc. The list are not read-only and I can pop them
> out of the list when I have found them!

items = [dict(a=1, b=2), dict(a=1, b=2), dict(a=3, c=2), dict(a=3, c=2.2),
dict(a=3, c=2.6)]

# 2.5
seen = set()
for row in items:
    rowkey = frozenset(row.iteritems())
    if rowkey in seen:
        print "dupe", row
    else:
        seen.add(rowkey)

# 2.3 (and older); not tested
seen = {}
for row in items:
    rowkey = row.items()
    rowkey.sort()
    rowkey = tuple(rowkey)
    if rowkey in seen:
        print "dupe", row
    else:
        seen[rowkey] = 1

> (How can I found items that are almost the same? e.g. {'a':
> 1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )

If you are content with detecting similarities on a single "axis":

def window(items):
    items = iter(items)
    prev = items.next()
    for cur in items:
        yield cur, prev
        prev = cur

def todict(k, v, rest):
    d = dict(rest)
    d[k] = v
    return d

axis = "c"
d = {}
eps = 0.5
for row in items:
    row = dict(row)
    try:
        value = row.pop(axis)
    except KeyError:
        pass
    else:
        key = frozenset(row.iteritems())
        d.setdefault(key, []).append(value)

for key, values in d.iteritems():
    if len(values) > 1:
        for cur, prev in window(sorted(values)):
            if abs(cur-prev) < eps:
                print "similar rows", todict(axis, cur, key), todict(axis,
prev, key)

Otherwise you have to define a distance function and use a clustering
algorithm. Segaran's "Programming Collective Intelligence" has some
examples in Python, though the code can hardly be called idiomatic.

Peter