R: unique items in lists

Bob Cannard bob_cannard at mentor.com
Tue Mar 27 13:46:40 EST 2001


Remco Gerlich wrote:
> 
> comp.lang.python <micampe at f2s.com> wrote in comp.lang.python:
> > Here is what I'm doing now:
> >
> >         for w in words:
> >             while words.count(w) > 1:
> >                 words.remove(w)
> >
> > comments?
> 
> First, you modify a list that you are looping over. That can give
> unpredictable results, but in this case it seems to work.

The list [1,1,2,2,3,3,2,1] breaks this algorithm, which produces
[3,3,2,1]. The reason it fails is that the algorithm preserves the
last element of each value, not the first, so the 3's are moving
to the left while the for loop's internal pointer is moving to the
right; the two cross each other without ever meeting. Heed that
warning in the language reference; iterating over a changing list
can seriously damage your mental health, not to mention your
algorithm!

> It will take a long time. 'words.remove(w)' takes time proportional to the
> length of the list; words.count(w) takes time proportional to the length of
> the list itself, and you repeat that for every duplicate of a word, and the
> total is repeated for every unique word. In total it takes time proportional
> to N^3 (for a list of length N), I think.

That's right. It's a horrific run time for any reasonably large N:
a billion operations for N = 1000.

> The dictionary method does one dictionary action for each word, and then it
> does a dict.keys() which takes time proportional to each unique word.

I don't know how dictionaries are implemented, but hopefully they're
a lot faster than that - one would expect a hash or tree implementation,
giving a logarithmic run time for each lookup. A linear implementation
would be an obvious cause of poor run time performance for Python
generally.

> In total
> it takes time proportional to N.

More than that; it would be order N for a list such as [1,1,1,1,1...]. I
would estimate something between N and N log N depending on the input
data and the implementation of dictionaries. In any case, it's vastly
better than the previous method: only several thousand operationsfor
N = 1000. In the worst case (a linear implementation of dictionaries
plus pathological data) the run time is still only N^2, or a million
operations for N = 1000.

Cheers,

        Bob.



More information about the Python-list mailing list