Most efficient solution?

Steve Holden sholden at holdenweb.com
Mon Jul 16 16:37:54 CEST 2001


"Jay Parlar" <jparlar at home.com> wrote in ...
> I have a simple problem, but one that, without some optimization, might
hurt the performance of my project. To sum it up, I
> have two lists, we'll call them list A and list B, both lists contain only
one-word strings (ie. both were generated by string.split()
> performed on a large amount of text)
>
> List B consists of my "stopwords", meaning, the words I don't want
included in my final version of list A. So what I need to
> do is check every item in list A, and if it occurs in list B, then I want
to remove it from the final version of A. My first thought
> would be:
>
> for eachItem in A:
>     if eachItem in B:
>         A.remove(eachItem)
>
> Now, this will work fine, however, the size of list A varies, and while
list B is constant, it is quite large (more than 750 items).
> List A can also, should the situation warrant, become quite huge as well.
The main problem is that I have no idea of the
> efficiency of "in" for large lists like this. Can anyone think of a
possibly quicker way, or is this the best?
> One other note: List A may (and usually does) contain duplicate words. If
those duplicate words appear in list B, then I want
> them both removed, but if they don't appear in list B, then I want them to
remain separate (ie. if the word "Python" shows up
> five times in A, then I want the final version of A to still contain five
occurrences of "Python")
>
> I really don't know if this can be made any quicker, but any insight would
be appreciated. For the cases I've been running,
> it's been quick enough so far, but there's a good chance that the amount
of data in list A will be getting much larger, and I'll
> have to perform this entire operation (for different versions of list A)
multiple times in one program execution.

You will probably find that for larger lists you'd be better of using a
dictionary, so you can test for a given word using has_key. However, you
also need to consider whether it is important to retain the non-noise words
in list A in their original order, as dictionaries don't have an special
ordering on their keys.

A hybrid solution (untested) would be:

    dict = {}
    for w in B:
        dict[w] = 1
    for eachitem in A:
        if dict.has_key(eachitem):
            A.remove(eachitem)

You don't say whether you want to retain repetitions in list A, or ordering.
Depending on your exact requirements there will be other solutions which are
more efficient, but the key (no pun intended) feature here is that it's
faster to look strings up as dictionary leys than it is to find them in a
long list.

regards
 Steve
--
http://www.holdenweb.com/








More information about the Python-list mailing list