Most efficient solution?

Mon Jul 16 16:02:55 EDT 2001

Jay,

you wrote:
> 
> I have a simple problem, but one that, without some optimization, might hurt the performance of my project. To sum it up, I
> have two lists, we'll call them list A and list B, both lists contain only one-word strings (ie. both were generated by string.split()
> performed on a large amount of text)
> 
> List B consists of my "stopwords", meaning, the words I don't want included in my final version of list A. So what I need to
> do is check every item in list A, and if it occurs in list B, then I want to remove it from the final version of A. My first thought
> would be:
> 
> for eachItem in A:
>     if eachItem in B:
>         A.remove(eachItem)
> 
> Now, this will work fine, however, the size of list A varies, and while list B is constant, it is quite large (more than 750 items).
> List A can also, should the situation warrant, become quite huge as well. The main problem is that I have no idea of the
> efficiency of "in" for large lists like this. Can anyone think of a possibly quicker way, or is this the best?

As pointed out by many others you can do the test
    if eachItem in B:
in constant time by using a dictionary:

Bdict = {}
for w in B: Bdict[w] = 1

I have not seen the following variation posted so here it goes:

[word for word in A if not Bdict.has_key(word)]

In case you insist on using A.remove() it will become the
bottleneck: it takes time in the order of the length of A.
The easy way to avoid this is by limiting the size of A
eg. by reading (tokenizing) a stream and filtering using
the test with Bdict.

Good luck,
Ype

-----------------
email at xs4all.nl