[Tutor] Advanced String Search using operators AND, OR etc..
C or L Smith
smiles at worksmail.net
Tue May 5 08:17:38 CEST 2009
>> From: Alex Feddor
>>
>> I am looking for method enables advanced text string search. Method
>> string.find() or re module seems no supporting what I am looking
>> for. The idea is as follows:
>>
>> Text ="FDA meeting was successful. New drug is approved for whole
>> sale distribution!"
>>
>>
>> I would really appreciated your advice - code sample / links how
>> above can
>> be achieved! if possible I would appreciated solution achieved with
>> free of
>> charge module.
The pieces to assemble a solution are not too hard to master. Instead of thinking of searching your text, think about searching a list of words in the text for what you are interested in.
The re pattern to match a word containing only letters is [a-zA-Z]+. This pattern can cut your text into words for you. A list of words corresponding to your text can then be made with re.findall():
###
>>> word=re.compile('[a-zA-Z]+')
>>> text = """FDA meeting was successful."""
>>> Words = re.findall(word, text)
>>> Words
['FDA', 'meeting', 'was', 'successful']
>>>
###
There are some gems hidden in some of the modules that are intended for one purpose but can be handy for another. For your purposes, the fnmatch module has a lightweight (compared to re) string matching function that can be used to find out if a word matches a given criteria or not. There are only 4 types of patterns to master:
* matches anything
? matches a single character
[seq] matches any character in the sequence
[!seq] matches any character NOT in the sequence
Within the module there is a case sensitive and case insensitive version of a pattern matcher. We can write a helper function that allows us to use either one (and it is set right now to be case sensitive by default):
###
import fnmatch
def match(pat, words, case=True):
"""See if pat matches an word in words list. It uses a generator
rather than a list inside the any() so as not to generate the
whole list if at all possible."""
if case:
return any(x for x in words if fnmatch.fnmatchcase(x,pat))
else:
return any(x for x in words if fnmatch.fnmatch(x,pat))
###
Now you can see if a certain pattern is in your list of words or not:
###
>>> Words=['FDA', 'meeting', 'was', 'successful']
>>> match('FDA',Words)
True
>>> match('fda',Words)
False
>>> match('fda',Words, case=False)
True
>>>
###
And now string together whatever tests you like for a given line:
###
>>> match('FDA',Words) and (match('approve*',Words) or match('success*',Words))
True
>>>
###
If you are searching a large piece of text you might want to turn the list of words into a set of unique words so there is less to search. The match function will work with it equally as well.
###
>>> text='this is a list is a list is a list'
>>> re.findall(word,text)
['this', 'is', 'a', 'list', 'is', 'a', 'list', 'is', 'a', 'list']
>>> set(_)
set(['this', 'a', 'is', 'list'])
>>> match('is', _)
True
>>>
###
You also might want to apply your search line by line, but those are details you might already know how to handle.
Hope that helps!
/chris
More information about the Tutor
mailing list