Most efficient solution?
jparlar at home.com
Mon Jul 16 09:19:09 EDT 2001
I have a simple problem, but one that, without some optimization, might hurt the performance of my project. To sum it up, I
have two lists, we'll call them list A and list B, both lists contain only one-word strings (ie. both were generated by string.split()
performed on a large amount of text)
List B consists of my "stopwords", meaning, the words I don't want included in my final version of list A. So what I need to
do is check every item in list A, and if it occurs in list B, then I want to remove it from the final version of A. My first thought
for eachItem in A:
if eachItem in B:
Now, this will work fine, however, the size of list A varies, and while list B is constant, it is quite large (more than 750 items).
List A can also, should the situation warrant, become quite huge as well. The main problem is that I have no idea of the
efficiency of "in" for large lists like this. Can anyone think of a possibly quicker way, or is this the best?
One other note: List A may (and usually does) contain duplicate words. If those duplicate words appear in list B, then I want
them both removed, but if they don't appear in list B, then I want them to remain separate (ie. if the word "Python" shows up
five times in A, then I want the final version of A to still contain five occurrences of "Python")
I really don't know if this can be made any quicker, but any insight would be appreciated. For the cases I've been running,
it's been quick enough so far, but there's a good chance that the amount of data in list A will be getting much larger, and I'll
have to perform this entire operation (for different versions of list A) multiple times in one program execution.
Software Engineering III
Hamilton, Ontario, Canada
"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."
More information about the Python-list