Orders of magnitude
Robert Brewer
fumanchu at amor.org
Mon Mar 29 18:21:57 EST 2004
Christian Tismer wrote:
> Robert Brewer wrote:
> > I'm dedup'ing a 10-million-record dataset, trying different
> approaches
> > for building indexes. The in-memory dicts are clearly
> faster, but I get
> > Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any
> recommendations on
> > other ways to build a large index without slowing down by a
> factor of
> > 25?
>
> So, here we go.
> The attached script processes a gigabyte of data, one
> million records, in about a minute on my machine, and
> finds the single duplicate.
Thanks, Christian! Looks like someone's been having fun. ;)
But I don't see how your solution avoids the problem, however (some code
snipped):
def process_duplicates(self):
for i in range(len(self.files)):
seen = {}
has = seen.has_key
for k in range(self.flushes):
bin = pickle.load(self.files[i])
for key, value in bin:
if has(key):
self.duplicates += 1
else:
seen[key] = value
...that is, the 'seen' dictionary is what is growing too large on my
dataset; it ends up being about 7 million items. Actually, I have three
such dictionaries (3 different fields which I'm checking for dupes for
each record). The hashing might cut these sizes down a bit, but I think
I'm still going to end up having to cache 'seen' (which I call 'indices'
in my Directory.produce() method) somehow.
Current ugly code is at http://www.aminus.org/rbre/python/dedupe.py
Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org
More information about the Python-list
mailing list