[Tutor] removal of duplicates from .csv files

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 25 Jan 2001 22:57:00 -0800 (PST)


On Thu, 25 Jan 2001, Rob Andrews wrote:

> I have been given several comma-delimited (.csv) files, each containing as
> many as several thousand lines of entries.  Among the tasks I've been
> charged with is to remove duplicate entries.  The files each contain fields
> for Contact Name, Company Name, Phone Number, and Address, among other
> fields, which vary from file to file.
> 
> I'm trying to determine a good way to sort for duplicates according to Phone
> Number and according to Address.  It seems that sorting by Phone Number
> would be simpler due to minor differences in the way data entry clerks might
> have input the addresses (W, W., and West, for instance), but not all
> entries have phone numbers.

Hello!  sort() takes in an optional "comparison" function that tells
Python how to order objects.  For example, let's say we have a list of
strings:

    L = ['this', 'is', 'a', 'test']

Let's say that we want to sort this list by last letter.  We could write
the following comparison function:

    def lastLetterCmp(x, y):
        return cmp(x[-1], y[1])

and sort L with it:

    L.sort(lastLetterCmp)


> equivalence check and manually read through the file.  The equivalence check
> itself seems simple, but I'm not sure how to scan only the target field
> (split(), maybe?), and I certainly want to avoid having to manually remove
> the duplicates afterward.

You'll want your home-grown comparison function to only check for the
fields that you're interested it sorting.