Performance problem with filtering

Wed Mar 13 22:54:40 EST 2002

> From: Gerhard Häring [mailto:gh_pythonlist at gmx.de]
> 
> I have two lists of files (approx. 50000 entries). Now I want 
> to have all the
> entries of list b, that are not in list a. However, the primitive:
> 
> results = []
> for entry in b:
>     if entry not in a:
>         results.append(entry)
> 
> is terribly slow. I mean *really* slow. Any recommendations 
> on how to optimize
> this? Wouldn't it be nice if I could simply do b.removeall(a)?

Firstly, you are doing 'entry in a' for each entry in b. This is a linear
search through a ... *very* slow.

So, the first thing I would recommend is ...

results = []
c = {}

for entry in a:
    c[entry] = None

for entry in b:
    if c.has_key(entry):
        results.append(entry)

This means you are searching for a key in a dictionary ... a very quick
operation.

A further optimisation is to use the filter() function.

results = []
c = {}

for entry in a:
    c[entry] = None

results = filter(c.has_key, b)

This should give you a couple of orders of magnitude speedup. I would
suggest trying both and comparing the performance.

Tim Delaney