[Tutor] partial string matching in list comprehension?
Kent Johnson
kent37 at tds.net
Fri May 26 03:00:16 CEST 2006
doug shawhan wrote:
> I have a series of lists to compare with a list of exclusionary terms.
>
> junkList =["interchange", "ifferen", "thru"]
>
> The comparison lists have one or more elements, which may or may not
> contain the junkList elements somewhere within:
>
> l = ["My skull hurts", "Drive the thruway", "Interchangability is not my
> forte"]
>
> ... output would be
>
> ["My skull hurts"]
>
> I have used list comprehension to match complete elements, how can I do
> a partial match?
One way is to use a helper function to do the test:
In [1]: junkList =["interchange", "ifferen", "thru"]
In [2]: lst = ["My skull hurts", "Drive the thruway", "Interchangability
is not my forte"]
In [3]: def hasJunk(s):
...: for junk in junkList:
...: if junk in s:
...: return True
...: return False
...:
In [4]: [ s for s in lst if not hasJunk(s) ]
Out[4]: ['My skull hurts', 'Interchangability is not my forte']
Hmm, I guess spelling counts :-)
also you might want to make this case-insensitive by taking s.lower() in
hasJunk().
Another way is to make a regular expression that matches all the junk:
In [7]: import re
Escape the junk in case it has any re-special chars:
In [9]: allJunk = '|'.join(re.escape(junk) for junk in junkList)
In [10]: allJunk
Out[10]: 'interchange|ifferen|thru'
You could compile with re.IGNORECASE to make case-insensitive matches.
Spelling still counts though ;)
In [11]: junkRe = re.compile(allJunk)
In [13]: [ s for s in lst if not junkRe.search(s) ]
Out[13]: ['My skull hurts', 'Interchangability is not my forte']
My guess is the re version will be faster, at least if you don't count
the compile, but only testing will tell for sure:
In [14]: import timeit
In [18]: timeit.Timer(setup='from __main__ import hasJunk,lst', stmt='[
s for s in lst if not hasJunk(s) ]').timeit()
Out[18]: 11.921303685244915
In [19]: timeit.Timer(setup='from __main__ import junkRe,lst', stmt='[ s
for s in lst if not junkRe.search(s) ]').timeit()
Out[19]: 8.3083201915327223
So for this data using re is a little faster. Test with real data to be
sure!
Kent
More information about the Tutor
mailing list