removing duplication from a huge list.
Shanmuga Rajan
m.shanmugarajan at gmail.com
Thu Feb 26 15:37:33 EST 2009
Hi
I have a list of Records with some details.(more than 15 million records)
with duplication
I need to iterate through every record and need to eliminate duplicate
records.
Currently i am using a script like this.
counted_recs = [ ]
x = some_fun() # will return a generator, this generator is the source of
list. because i dont want to carry entire 15 million records in a list(need
more 1 gb of memory)
for rec in x:
if rec[0] not in counted_recs:
#some logics goes here...
counted_recs.append(rec[0]) # i need to have rec[0]=name
alone from record.
but i am sure this is not a optimized way to do.
so i came up with a different solution. but i am not confident in that
solution too
here my second solution.
counted_recs = []
x = some_fun()
#x = [ rec[0] for rec in x]
for rec in x:
if counted_recs.count(rec[0]) > 0 :
# some logic goes here
counted_recs.append(rec[0])
which one is better? if any one suggests better solution then i will be
very happy.
Advance thanks for any help.
Shan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090226/5062799b/attachment.html>
More information about the Python-list
mailing list