Most efficient solution?
Ype Kingma
ykingma at accessforall.nl
Mon Jul 16 16:02:55 EDT 2001
Jay,
you wrote:
>
> I have a simple problem, but one that, without some optimization, might hurt the performance of my project. To sum it up, I
> have two lists, we'll call them list A and list B, both lists contain only one-word strings (ie. both were generated by string.split()
> performed on a large amount of text)
>
> List B consists of my "stopwords", meaning, the words I don't want included in my final version of list A. So what I need to
> do is check every item in list A, and if it occurs in list B, then I want to remove it from the final version of A. My first thought
> would be:
>
> for eachItem in A:
> if eachItem in B:
> A.remove(eachItem)
>
> Now, this will work fine, however, the size of list A varies, and while list B is constant, it is quite large (more than 750 items).
> List A can also, should the situation warrant, become quite huge as well. The main problem is that I have no idea of the
> efficiency of "in" for large lists like this. Can anyone think of a possibly quicker way, or is this the best?
As pointed out by many others you can do the test
if eachItem in B:
in constant time by using a dictionary:
Bdict = {}
for w in B: Bdict[w] = 1
I have not seen the following variation posted so here it goes:
[word for word in A if not Bdict.has_key(word)]
In case you insist on using A.remove() it will become the
bottleneck: it takes time in the order of the length of A.
The easy way to avoid this is by limiting the size of A
eg. by reading (tokenizing) a stream and filtering using
the test with Bdict.
Good luck,
Ype
-----------------
email at xs4all.nl
More information about the Python-list
mailing list