[Tutor] Advanced String Search using operators AND, OR etc..

C or L Smith smiles at worksmail.net
Tue May 5 08:17:38 CEST 2009


>> From: Alex Feddor 
>> 
>> I am looking for method enables advanced text string search. Method
>> string.find() or re module seems no  supporting what I am looking
>> for. The idea is as follows:
>> 
>> Text ="FDA meeting was successful. New drug is approved for whole
>> sale distribution!"
>> 
>> 
>> I would really appreciated your advice - code sample / links how
>> above can 
>> be achieved! if possible I would appreciated solution achieved with
>> free of 
>> charge module.

The pieces to assemble a solution are not too hard to master. Instead of thinking of searching your text, think about searching a list of words in the text for what you are interested in.

The re pattern to match a word containing only letters is [a-zA-Z]+. This pattern can cut your text into words for you. A list of words corresponding to your text can then be made with re.findall():

###
>>> word=re.compile('[a-zA-Z]+')
>>> text = """FDA meeting was successful."""
>>> Words = re.findall(word, text)
>>> Words
['FDA', 'meeting', 'was', 'successful']
>>> 
###

There are some gems hidden in some of the modules that are intended for one purpose but can be handy for another. For your purposes, the fnmatch module has a lightweight (compared to re) string matching function that can be used to find out if a word matches a given criteria or not. There are only 4 types of patterns to master:

* matches anything
? matches a single character
[seq] matches any character in the sequence
[!seq] matches any character NOT in the sequence

Within the module there is a case sensitive and case insensitive version of a pattern matcher. We can write a helper function that allows us to use either one (and it is set right now to be case sensitive by default):

###
import fnmatch
def match(pat, words, case=True):
    """See if pat matches an word in words list. It uses a generator
    rather than a list inside the any() so as not to generate the
    whole list if at all possible."""
    if case:
        return any(x for x in words if fnmatch.fnmatchcase(x,pat))
    else:
        return any(x for x in words if fnmatch.fnmatch(x,pat))
###

Now you can see if a certain pattern is in your list of words or not:

###
>>> Words=['FDA', 'meeting', 'was', 'successful']
>>> match('FDA',Words)
True
>>> match('fda',Words)
False
>>> match('fda',Words, case=False)
True
>>> 
###

And now string together whatever tests you like for a given line:

###
>>> match('FDA',Words) and (match('approve*',Words) or match('success*',Words))
True
>>> 
###

If you are searching a large piece of text you might want to turn the list of words into a set of unique words so there is less to search. The match function will work with it equally as well.

###
>>> text='this is a list is a list is a list'
>>> re.findall(word,text)
['this', 'is', 'a', 'list', 'is', 'a', 'list', 'is', 'a', 'list']
>>> set(_)
set(['this', 'a', 'is', 'list'])
>>> match('is', _)
True
>>> 
###

You also might want to apply your search line by line, but those are details you might already know how to handle. 

Hope that helps!

/chris


More information about the Tutor mailing list