removing duplication from a huge list.

Thu Feb 26 15:37:33 EST 2009

Hi

I have a list of Records with some details.(more than 15 million records)
with duplication

I need to iterate through every record and need to eliminate duplicate
records.

Currently i am using a script like this.

counted_recs = [ ]
x = some_fun()  #  will return a generator, this generator is the source of
list. because i dont want to carry entire 15 million records in a list(need
more 1 gb of memory)

for rec in x:
    if rec[0] not in counted_recs:
        #some logics goes here...
        counted_recs.append(rec[0])        # i need to have rec[0]=name
alone from record.

but i am sure this is not a optimized way to do.
so i came up with a different solution. but i am not confident in that
solution too

here my second solution.

counted_recs = []

x = some_fun()

#x = [ rec[0] for rec in x]

for rec in x:
    if counted_recs.count(rec[0]) > 0 :
        # some logic goes here
        counted_recs.append(rec[0])

which one is better?  if any one suggests better solution then i will be
very happy.

Advance thanks for any help.

Shan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090226/5062799b/attachment.html>