removing duplication from a huge list.

Fri Feb 27 01:17:58 EST 2009

On Feb 26, 9:15 pm, Chris Rebert <c... at rebertia.com> wrote:
> On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson <benja... at python.org> wrote:
> > Shanmuga Rajan <m.shanmugarajan <at> gmail.com> writes:
>
> >> f any one suggests better solution then i will be very happy.Advance thanks
> > for any help.Shan
>
> > Use a set.
>
> To expand on that a bit:
>
> counted_recs = set(rec[0] for rec in some_fun())
> #or in Python 3.0:
> counted_recs = {rec[0] for rec in some_fun()}
>
> Cheers,
> Chris
>
> --
> Follow the path of the Iguana...http://rebertia.com

How big of a list are we talking about? If the list is so big that the
entire list cannot fit in memory at the same time this approach wont
work e.g. removing duplicate lines from a very large file. A couple of
things come into play at that point. If order does not matter then I
would suggest looking at some recipes for first sorting the large file
and then iterating through the lines removing duplicates as you go
( if cur_line != last_line: write cur_line; last_line = cur_line )

If order does matter then let me know and I will post a recipe for
that.