[Tutor] Re: removal of duplicates from .csv files
Fri, 26 Jan 2001 11:57:39 -0500
> I have been given several comma-delimited (.csv) files, each containing
> many as several thousand lines of entries. Among the tasks I've been
> charged with is to remove duplicate entries. The files each contain
> for Contact Name, Company Name, Phone Number, and Address, among other
> fields, which vary from file to file.
One approach you may want to consider is to create a dictionary with the
phone number and/or address as a key.
Read in a line at a time and split() out the field you are looking for.
Format the string to remove as much ambiguity as possible. Then you can
use the has_key() method to check if it exists. If it doesn't exist
create a dictionary item using the string as a key for future compares
and output the line to a "non-duplicate" file. If it does exist, output
the line to a "duplicate" file. Using this method, you could first check
if the record has a phone number and compare based on that. If it
doesn't, you could then fall back to the less accurate address check.
Hope this helps,
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com