[Tutor] removal of duplicates from .csv files
Thu, 25 Jan 2001 22:57:00 -0800 (PST)
On Thu, 25 Jan 2001, Rob Andrews wrote:
> I have been given several comma-delimited (.csv) files, each containing as
> many as several thousand lines of entries. Among the tasks I've been
> charged with is to remove duplicate entries. The files each contain fields
> for Contact Name, Company Name, Phone Number, and Address, among other
> fields, which vary from file to file.
> I'm trying to determine a good way to sort for duplicates according to Phone
> Number and according to Address. It seems that sorting by Phone Number
> would be simpler due to minor differences in the way data entry clerks might
> have input the addresses (W, W., and West, for instance), but not all
> entries have phone numbers.
Hello! sort() takes in an optional "comparison" function that tells
Python how to order objects. For example, let's say we have a list of
L = ['this', 'is', 'a', 'test']
Let's say that we want to sort this list by last letter. We could write
the following comparison function:
def lastLetterCmp(x, y):
return cmp(x[-1], y)
and sort L with it:
> equivalence check and manually read through the file. The equivalence check
> itself seems simple, but I'm not sure how to scan only the target field
> (split(), maybe?), and I certainly want to avoid having to manually remove
> the duplicates afterward.
You'll want your home-grown comparison function to only check for the
fields that you're interested it sorting.