[Tutor] a question about list

Fri Nov 12 04:19:12 CET 2004

Max Noel wrote:
> 
> On Nov 11, 2004, at 23:17, Terry Carroll wrote:
> a = [element for index, element in enumerate(a) if index not in b]
> 
>     Looks a bit more Pythonic to me. I also think it's faster.

Regular readers of this list know I can't resist an optimization 
challenge :-) Also I was suspicious of Max's claim, as this method is 
clearly O(len(a) * len(b)) at least - for each element of a, the list b 
is searched for the index.

Terry Carroll proposed:
>  b_rev = b
Note: if you want to copy b, use b_rev = b[:]
>  b_rev.sort()      # b = [1, 2, 4] ; sorted just to be sure
>  b_rev.reverse()   # b = [4, 2, 1]
>  for i in b_rev:
>    del a[i]

At first glance this seems more promising, as it skips the searching 
through b. But the del a[i] step requires copying all the elements of a 
after location i, so it is O(len(a)) and this method also seems to be 
O(len(a) * len(b)) - that is, the time taken will be proportional to the 
length of a times the length of b.

Liam Clarke proposed

> for i  in b:
>    a[i]="XX"    <-- this makes a=['a','XX','XX','d','XX','f','g']
> 
> while a.count("XX"):  While the number of occurences of 'XX' in a are above zero, do this...
>       a.remove("XX")   Remove the first occurrence of 'XX' found

This doesn't strike me as too promising because removing the 'XX' values 
requires _two_ searches through a for _each_ removal. But the removal 
could be done with a list comprehension which would make it work with a 
single pass over a:
a = [ element for element in a if element != 'XX' ]

Of course this requires that 'XX' does not appear in a, which could be a 
problem. A way around that is to make a unique sentinel and use it instead:
class Sentinel:
     pass
s = Sentinel()

This algorithm appears to be O(len(a) + len(b)) which could be a 
significant advantage for large a and b.

So, those are the three contenders. Here is a program to test them. It 
removes every fourth element from a list of 10000 integers:

##################

import random, timeit

a = range(10000)
b = range(0, 10000, 4)

random.shuffle(b)  # Make the sort do some work

def remove1(a, b):
     b = b[:]
     b.sort()
     b.reverse()
     for i in b:
         del a[i]

def remove2(a, b):
     a[:] = [ element for i, element in enumerate(a) if i not in b ]

def remove3(a, b):
     class Sentinel: pass
     s = Sentinel()

     for i in b:
         a[i] = s
     a[:] = [elem for elem in a if elem is not s]

def timeOne(fn):
     setup = "from __main__ import " + fn.__name__ + ',a,b'
     stmt = fn.__name__ + '(a[:], b[:])'

     t = timeit.Timer(stmt, setup)
     secs = min(t.repeat(3, 1))
     print '%s: %f secs' % (fn.__name__, secs)

# Make sure they all give the same answer
a1=a[:]
b1 = b[:]
remove1(a1, b1)

a2=a[:]
b2 = b[:]
remove2(a2, b2)

a3=a[:]
b3 = b[:]
remove3(a3, b3)

assert a1 == a2
assert a3 == a2

timeOne(remove1)
timeOne(remove2)
timeOne(remove3)

################

Here are my results:
remove1: 0.023205 secs
remove2: 1.262509 secs
remove3: 0.004119 secs

So with large a and b, the algorithm with the sentinel is a clear 
winner. The enumerate algorithm is far slower than the one with del 
a[i]; my guess this is because the del does a block move of memory, 
which can be very fast, while the enumerate version has to actually search.

What about for small a and b? Here are my results with
a = range(100)
b = range(0, 100, 4)

remove1: 0.000032 secs
remove2: 0.000181 secs
remove3: 0.000052 secs

Now the del version wins; I think the very fast built-in list operations 
take the prize here.

Of course with small a and b, these are all very fast and the enumerate 
version has the advantage of being concise and readable.

Kent