Performance problem with filtering
Delaney, Timothy
tdelaney at avaya.com
Wed Mar 13 22:54:40 EST 2002
> From: Gerhard Häring [mailto:gh_pythonlist at gmx.de]
>
> I have two lists of files (approx. 50000 entries). Now I want
> to have all the
> entries of list b, that are not in list a. However, the primitive:
>
> results = []
> for entry in b:
> if entry not in a:
> results.append(entry)
>
> is terribly slow. I mean *really* slow. Any recommendations
> on how to optimize
> this? Wouldn't it be nice if I could simply do b.removeall(a)?
Firstly, you are doing 'entry in a' for each entry in b. This is a linear
search through a ... *very* slow.
So, the first thing I would recommend is ...
results = []
c = {}
for entry in a:
c[entry] = None
for entry in b:
if c.has_key(entry):
results.append(entry)
This means you are searching for a key in a dictionary ... a very quick
operation.
A further optimisation is to use the filter() function.
results = []
c = {}
for entry in a:
c[entry] = None
results = filter(c.has_key, b)
This should give you a couple of orders of magnitude speedup. I would
suggest trying both and comparing the performance.
Tim Delaney
More information about the Python-list
mailing list