removing duplication from a huge list.

Paul Rubin http
Fri Feb 27 23:18:36 CET 2009

Tim Rowe <digitig at> writes:
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.

That is not terribly many records by today's standards.  The knee-jerk
approach is to sort them externally, then make a linear pass skipping
the duplicates.  Is the exercise to write an external sort in Python?
It's worth doing if you've never done it before.

More information about the Python-list mailing list