removing duplication from a huge list.

Fri Feb 27 11:55:50 EST 2009

On Feb 27, 8:33 am, Tim Rowe <digi... at gmail.com> wrote:
> 2009/2/27 odeits <ode... at gmail.com>:
>
> > How big of a list are we talking about? If the list is so big that the
> > entire list cannot fit in memory at the same time this approach wont
> > work e.g. removing duplicate lines from a very large file.
>
> We were told in the original question: more than 15 million records,
> and it won't all fit into memory. So your observation is pertinent.
>
> --
> Tim Rowe

If order did matter, and the list itself couldn't be stored in memory,
I would personally do some sort of hash of each item (or something as
simple as first 5 bytes, last 5 bytes and length), keeping a reference
to which item the hash belongs, sort and identify duplicates in the
hash, and using the reference check to see if the actual items in
question match as well.

Pretty brutish and slow, but it's the first algorithm which comes to
mind. Of course, I'm assuming that the list items are long enough to
warrant using a hash and not the values themselves.

~G