[Tutor] Re: removal of duplicates from .csv files

Steve slyskawa@yahoo.com
Fri, 26 Jan 2001 11:57:39 -0500

> I have been given several comma-delimited (.csv) files, each containing
> many as several thousand lines of entries.  Among the tasks I've been
> charged with is to remove duplicate entries.  The files each contain
> for Contact Name, Company Name, Phone Number, and Address, among other
> fields, which vary from file to file.

One approach you may want to consider is to create a dictionary with the
phone number and/or address as a key.

Read in a line at a time and split() out the field you are looking for.
Format the string to remove as much ambiguity as possible.  Then you can
use the has_key() method to check if it exists.  If it doesn't exist
create a dictionary item using the string  as a key for future compares
and output the line to a "non-duplicate" file.  If it does exist, output
the line to a "duplicate" file.  Using this method, you could first check
if the record has a phone number and compare based on that.  If it
doesn't, you could then fall back to the less accurate address check.

Hope this helps,

Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com