Orders of magnitude

Mon Mar 29 18:21:57 EST 2004

Christian Tismer wrote:
> Robert Brewer wrote:
> > I'm dedup'ing a 10-million-record dataset, trying different 
> approaches
> > for building indexes. The in-memory dicts are clearly 
> faster, but I get
> > Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any 
> recommendations on
> > other ways to build a large index without slowing down by a 
> factor of
> > 25?
> 
> So, here we go.
> The attached script processes a gigabyte of data, one
> million records, in about a minute on my machine, and
> finds the single duplicate.

Thanks, Christian! Looks like someone's been having fun. ;)

But I don't see how your solution avoids the problem, however (some code
snipped):

    def process_duplicates(self):
        for i in range(len(self.files)):
            seen = {}
            has = seen.has_key
            for k in range(self.flushes):
                bin = pickle.load(self.files[i])
                for key, value in bin:
                    if has(key):
                        self.duplicates += 1
                    else:
                        seen[key] = value

...that is, the 'seen' dictionary is what is growing too large on my
dataset; it ends up being about 7 million items. Actually, I have three
such dictionaries (3 different fields which I'm checking for dupes for
each record). The hashing might cut these sizes down a bit, but I think
I'm still going to end up having to cache 'seen' (which I call 'indices'
in my Directory.produce() method) somehow.

Current ugly code is at http://www.aminus.org/rbre/python/dedupe.py

Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org