[Tutor] partial string matching in list comprehension?

Fri May 26 03:00:16 CEST 2006

doug shawhan wrote:
> I have a series of lists to compare with a list of exclusionary terms.
> 
> junkList =["interchange",  "ifferen", "thru"]
> 
> The comparison lists have one or more elements, which may or may not 
> contain the junkList elements somewhere within:
> 
> l = ["My skull hurts", "Drive the thruway", "Interchangability is not my 
> forte"]
> 
> ... output would be
> 
> ["My skull hurts"]
> 
> I have used list comprehension to match complete elements, how can I do 
> a partial match?

One way is to use a helper function to do the test:

In [1]: junkList =["interchange",  "ifferen", "thru"]

In [2]: lst = ["My skull hurts", "Drive the thruway", "Interchangability
is not my forte"]

In [3]: def hasJunk(s):
     ...:     for junk in junkList:
     ...:         if junk in s:
     ...:             return True
     ...:     return False
     ...:

In [4]: [ s for s in lst if not hasJunk(s) ]
Out[4]: ['My skull hurts', 'Interchangability is not my forte']

Hmm, I guess spelling counts :-)
also you might want to make this case-insensitive by taking s.lower() in
hasJunk().

Another way is to make a regular expression that matches all the junk:

In [7]: import re

Escape the junk in case it has any re-special chars:
In [9]: allJunk = '|'.join(re.escape(junk) for junk in junkList)

In [10]: allJunk
Out[10]: 'interchange|ifferen|thru'

You could compile with re.IGNORECASE to make case-insensitive matches.
Spelling still counts though ;)

In [11]: junkRe = re.compile(allJunk)

In [13]: [ s for s in lst if not junkRe.search(s) ]
Out[13]: ['My skull hurts', 'Interchangability is not my forte']

My guess is the re version will be faster, at least if you don't count
the compile, but only testing will tell for sure:

In [14]: import timeit

In [18]: timeit.Timer(setup='from __main__ import hasJunk,lst', stmt='[
s for s in lst if not hasJunk(s) ]').timeit()
Out[18]: 11.921303685244915

In [19]: timeit.Timer(setup='from __main__ import junkRe,lst', stmt='[ s
for s in lst if not junkRe.search(s) ]').timeit()
Out[19]: 8.3083201915327223

So for this data using re is a little faster. Test with real data to be
sure!

Kent