extracting duplicates from CSV file by specific fields

Rhodri James rhodri at wildebst.demon.co.uk
Tue Apr 28 21:14:21 EDT 2009


On Wed, 29 Apr 2009 01:53:24 +0100, VP <vadim.pestovnikov at gmail.com> wrote:

> Hi,
> I have a csv file:
>
> 'aaa.111', 'T100', 'pn123', 'sn111'
> 'aaa.111', 'T200', 'pn123', 'sn222'
> 'bbb.333', 'T300', 'pn123', 'sn333'
> 'ccc.444', 'T400', 'pn123', 'sn444'
> 'ddd', 'T500', 'pn123', 'sn555'
> 'eee.666', 'T600', 'pn123', 'sn444'
> 'fff.777', 'T700', 'pn123', 'sn777'
>
> How can I extract duplicates checking each row by filed1 and filed4?


Untested:

import csv

seen_in_field0 = set()
seen_in_field3 = set()

reader = csv.reader(open("myfile.csv", "rb"))
for row in reader:
     if row[0] in seen_in_field0 or row[3] in seen_in_field3:
         reject_this(row)
     else:
	seen_in_field0.add(row[0])
	seen_in_field3.add(row[3])
         accept_this(row)


This assumes that you don't record fields 0 and 3 for lines that
are rejected, i.e. if the file is:

'aaa.111', 'T100', 'pn123', 'sn111'
'aaa.111', 'T200', 'pn123', 'sn222'
'aaa.222', 'T300', 'pn123', 'sn222'

you want to keep:

'aaa.111', 'T100', 'pn123', 'sn111'
'aaa.222', 'T300', 'pn123', 'sn222'

-- 
Rhodri James *-* Wildebeeste Herder to the Masses



More information about the Python-list mailing list