removing duplication from a huge list.
python.list at tim.thechases.com
Fri Feb 27 18:30:45 CET 2009
>> How big of a list are we talking about? If the list is so big that the
>> entire list cannot fit in memory at the same time this approach wont
>> work e.g. removing duplicate lines from a very large file.
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.
Assuming the working set of unique items will still fit within
memory, it can be done with the following regardless of the
seen = set()
for item in iterable:
if item not in seen:
s = [7,6,5,4,3,6,9,5,4,3,2,5,4,3,2,1]
for line in deduplicator(file('huge_test.txt')):
It maintains order, emitting only new items as they're encountered.
More information about the Python-list