Orders of magnitude

Robert Brewer fumanchu at amor.org
Tue Mar 30 09:29:34 CEST 2004

Christian Tismer wrote:
> >>Robert Brewer wrote:
> >>
> >>>I'm dedup'ing a 10-million-record dataset, trying different 
> >>>approaches
> >>>for building indexes. The in-memory dicts are clearly 
> >>>faster, but I get
> >>>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any 
> > But I don't see how your solution avoids the problem, 
> however (some code
> > snipped):
> > 
> >     def process_duplicates(self):
> >         for i in range(len(self.files)):
> >             seen = {}
> >             has = seen.has_key
> >             for k in range(self.flushes):
> >                 bin = pickle.load(self.files[i])
> >                 for key, value in bin:
> >                     if has(key):
> >                         self.duplicates += 1
> >                     else:
> >                         seen[key] = value
> Just try it, it works!
> [discussion of hashing snipped]

Thanks again. Really, thanks!

However, I think you're letting the arrow fly before you've seen the
target. I understand hashing and making a bucket for each possible first
character (I should have read your code more thoroughly). That is a fine
technique, and it was already on my short list of workarounds. However,
your _complete_ solution isn't feasible for my dataset, because I'm
comparing some fields within each record, not entire records. For
example, with rows like:

"Michael","Kemperson","26002","me at aol.com",""
"Ryan","Thomas","46723","myfriend at yahoo.com",""
"Vicki","Thomas","46723","myfriend at yahoo.com",""
"Shanuana","Hedlund","10415","andi at yahoo.com",""

...I need to keep Ryan and dump Vicki, based on shared email and IP, but
*not* on the other fields. If Vicki had the same email but a different
IP, don't drop that record, but blank out the 2nd phone number. In
addition, most but not all files have phone numbers, which get the same
treatment. That's why I said I have three dicts/indexes.

No, don't rewrite it! :D I have even *more* requirements, which are
mostly in my code but not really documented, and some are changing. I'm
working on a more scalable solution, since I then need to apply the same
code to a 20-million-record DB. I'm doing fine--just wondered in
particular about ways to optimize that single step. Thanks for your

Robert Brewer
Amor Ministries
fumanchu at amor.org

More information about the Python-list mailing list