Most efficient solution?

Gareth McCaughan Gareth.McCaughan at pobox.com
Mon Jul 16 18:19:07 EDT 2001


Jay Parlar wrote:
[...]
[He has two lists of words, a long one (A) and a shorter one (B).
He wants to get the result of removing every instance of each word
in list B from list A.]

Many people have suggested a dictionary-based solution. Here's
another approach, which turns out to be slower (for me, anyway).

Build a regular expression which matches a word exactly when
it isn't in list B. It should look like "(?!....$)"
where "...." matches exactly the words in list B.
(This is a "negative lookahead assertion". Actually it's not
quite right to say that it "matches a word"; it matches a
zero-length substring when it's followed by something that
matches "...." and the end of the string!)

In building "...." you need to take care to make use of common
initial substrings of the words in list B. It's probably
also a good idea to exploit character classes.

I tried this with list A containing all the words (including
repetitions) in an archive of about 800 articles from another
newsgroup, and list B being the 750 commonest words from list A.
List A had about 310000 words in it. (Some of these "words"
were fragments of message-IDs, etc, of course.)

The dictionary-based approach took about 0.96 seconds. The
regexp-based approach took about 1.00 seconds. (Plus the time
to build and compile the regular expression, but that's fixed
however large list A gets.)

This suggests that there might be applications in which
the regexp approach is better, though I wouldn't bet on it.
It also suggests that the lists will have to be quite long
before you care much about the speed. (Admittedly my machine
is quite fast.)

There's something like a factor of 2 improvement in speed
in going from string.join(words, '|') to something more
sophisticated; and (a little surprisingly, to me) another
factor of about 2 in going from

    filter(lambda x: not r.match(x), words)

to

    filter(r.match, words)

with r containing a negative lookahead assertion instead of
a regexp designed to match the things in list B.

-- 
Gareth McCaughan  Gareth.McCaughan at pobox.com
.sig under construc



More information about the Python-list mailing list