Orders of magnitude

Tue Mar 30 10:37:31 EST 2004

Robert Brewer wrote:

...

>>[discussion of hashing snipped]
> 
> 
> Thanks again. Really, thanks!
> 
> However, I think you're letting the arrow fly before you've seen the
> target.

Oops :-)
This was a late-night session, then I tend to shoot
early :-)

> I understand hashing and making a bucket for each possible first
> character (I should have read your code more thoroughly). That is a fine
> technique, and it was already on my short list of workarounds.> However,
> your _complete_ solution isn't feasible for my dataset, because I'm
> comparing some fields within each record, not entire records.

Sure, that's why I have a file-like object as parameter,
and you are supposed to do you preprocessing and create
the string that should represent the interesting part of
a record. I didn't plan to code that for you, but
expected you would want to provide your function and
use stuff at your convenience.

> For
> example, with rows like:
> 
> "Michael","Kemperson","26002","me at aol.com","64.12.104.18"
> "Ryan","Thomas","46723","myfriend at yahoo.com","209.244.21.23"
> "Vicki","Thomas","46723","myfriend at yahoo.com","209.244.21.23"
> "Shanuana","Hedlund","10415","andi at yahoo.com","122.150.184.175"
> 
> ...I need to keep Ryan and dump Vicki, based on shared email and IP, but
> *not* on the other fields. If Vicki had the same email but a different
> IP, don't drop that record, but blank out the 2nd phone number. In
> addition, most but not all files have phone numbers, which get the same
> treatment. That's why I said I have three dicts/indexes.

That's bad, of course, since you were asking how to find duplicates
in database records, which means for me, that a single processing
step is possible which calculates the unique information from
the current record, only.

If you actually need grouping and decisions based upon that,
the apporach doesn't work, since you need to hold all the information
to further process it.
In this case, an external sorting process might make sense,
with the record fields rearranged with email and IP going first,
then you can probably do a sequential process over the data.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/