removing duplicates from .csv files

Greg Jorgensen gregj at pobox.com
Thu Jan 25 14:28:35 EST 2001


I suggest Excel, unless you have to do this a lot. What I've done when
faced with this problem is import the file(s) into an Excel worksheet,
clean up the columns you care about (remove leading/trailing spaces,
normalize upper/lower case), and fix phone numbers, zip codes, etc. so
the punctuation is normalized (e.g. 800-555-1212 vs 800 555 1212). Next
sort the whole thing on one of the columns that will identify
duplicates--phone number is a good start. Add a colum that will flag
duplicates: in the second non-blank row of the new column enter:

  if(X2=X1,"*","")

Where X refers to the column in your worksheet. If you have blank phone
numbers your row numbers will be higher than 1 and 2, but you get the
idea. Copy this formula down to the end of the rows. Now you'll see a *
in any row where the phone number is the same as the previous row. You
can merge/correct the duplicate rows, then delete the ones you don't
want. Do this with all of the columns that may identify duplicates
until the data is clean.

If you will be doing this a lot it would be an interesting and non-
trivial programming problem in Python, or any language. There are
commercial programs that do de-duping/merging of name/address info, and
some of them are quite expensive.

--
Greg Jorgensen
Portland, Oregon, USA
gregj at pobox.com


Sent via Deja.com
http://www.deja.com/



More information about the Python-list mailing list