[Tutor] List Dup-Elim Method?

Tim Peters tim.one@home.com
Sat, 3 Feb 2001 04:43:14 -0500


If the elements of a list are hashable, the method using a dict will be by
far the fastest if the list is large.

If the elements are not hashable (for example, a list of lists), but can be
sorted, still much quicker than brute force (yet still much slower than
using a temporary dict!) is this:

def dupfree(x):
    """Remove all duplicates from list x, in place.

    The elements of x must enjoy a total ordering.
    """

    n = len(x)
    if n > 1:
        x.sort()
        last = x[0]
        i = avail = 1
        while i < n:
            if x[i] != last:
                x[avail] = last = x[i]
                avail += 1
            i += 1
        del x[avail:]

That requires Python 2.0, because of the "+=" thingies.

Example:

>>> x = [[1, 2], [3, 4]]
>>> x *= 10
>>> x
[[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4],
 [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4],
 [1, 2], [3, 4], [1, 2], [3, 4]]
>>> dupfree(x)
>>> x
[[1, 2], [3, 4]]
>>>

How does it work?  Sorting the list brings all the equal elements next to
each other.  (Note:  Many sorting routines don't work very quickly when
there are lots of equal elements, but Python's does.  I know that because I
wrote Python's list.sort() <wink>.)

After all the equal elements are adjacent, it just marches across the list
once, keeping track of when the list element at index i *changes*.  When it
does, it moves that not-seen-before element toward the front of the list,
and moves on skipping over the chunk of elements equal to *it*.  Finally, it
gets rid of the list positions no longer needed (the "del" stmt).

Of course there's nothing to do if the list doesn't have at least two
elements to start, so it checks for that first.  It *has* to avoid going
into the main body if the list is empty, because then "last = x[0]" would
blow up.  Since it has to check for that anyway, it doesn't cost any more to
make sure there are at least two things in the list.

This isn't an easy algorithm -- the steps are delicate.  If you can use the
dict method instead, do so!

what-you-can-assume-constrains-what-you-can-do-ly y'rs  - tim