removing duplicates from .csv files

Mon Jan 29 12:22:51 EST 2001

In article <94ptre$uf0$1 at nnrp1.deja.com>,
	mspiggie at my-deja.com writes:
> I have been given several comma-delimited (.csv) files, each containing
> as many as several thousand lines of entries.  Among the tasks I've
> been charged with is to remove duplicate entries.  The files each
> contain fields for Contact Name, Company Name, Phone Number, and
> Address, among other fields, which vary from file to file.
> 
> I'm trying to determine a good way to sort for duplicates according to
> Phone Number and according to Address.  It seems that sorting by Phone
> Number would be simpler due to minor differences in the way data entry
> clerks might have input the addresses (W, W., and West, for instance),
> but not all entries have phone numbers.

Good idea.

> 
> I have already come up with some code to work around most of the
> obvious problems with the different ways addresses may be input, but
> I'm not sure what the best way to sort for duplicates might be.  One
> suggestion I have received is to have lines with identical fields
> placed back to back with an equivalence check and manually read through
> the file.  The equivalence check itself seems simple, but I'm not sure
> how to scan only the target field (split(), maybe?), and I certainly
> want to avoid having to manually remove the duplicates afterward.
> 

Well, use the module suggested by Dave to read the stuff.
Then you can extract phonenumber and some more important field
and put them into tuples like this:

(phone,adress (or lastname or ..), .... ,record number in the source file)

using sort on a list of these tuples will first sort them 
accordingly to phonenumbers first, and those that have
the same phonenumber (or no phonenumber) will be sorted
acccording to the second field and so on.

Then you can scan through the list and find groups of
records that are identical according to your search criterion.
(A faster approach is probably to build a binary tree.)
If these groups are small enough, you can compare all
combinations of records with your equivalence check.
e.g. compare record
1 with 2
1 with 3
...
1 with n

2 with 3
2 with 4
...
2 with n

.
.
.

n-1 with n

If two record are identical, copy the line number in
the right tuple to a "delete later list".
(You only want to delete one of the identical records...)

After that, start deleting all records in the file according
to the recordnumbers in the "delete later list".
(if record = line, it's easy.)

This is untested and depends on the sorting being is so efficient
that you don't get very large groups.
If you can, try to normalize fields and use them as search criterions.
e.g. if you sort 
XXX-XXXX and XXX XX XX they will not end up besides each other,
that's why you have to transform them to a common form
(XXXXXXX or something) before sorting.

Hope this helps a little
Fredrik