Orders of magnitude

Robert Brewer fumanchu at amor.org
Mon Mar 29 08:47:00 CEST 2004


x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]

# list version: c = []
for i in x:
    if i not in c:
        c.append(i)
>>> t1.timeit(1)
2.0331145438875637

# dict version: c = {}
for i in x:
    if i not in c:
        c[i] = None
>>> t2.timeit(1)
0.0067952770534134288

# bsddb version: c = bsddb.btopen(None)
for i in x:
    if i not in c:
        c[i] = None
>>> t3.timeit(1)
0.18430750276922936

Wow. Dicts are *fast*.

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org




More information about the Python-list mailing list