[Tutor] Advanced String Search using operators AND, OR etc..

Lie Ryan lie.1296 at gmail.com
Tue May 5 14:11:22 CEST 2009


Alex Feddor wrote:
> Hi
> 
> I am looking for method enables advanced text string search. Method 
> string.find() or re module seems no  supporting what I am looking for. 
> The idea is as follows:
> 
> Text ="FDA meeting was successful. New drug is approved for whole sale 
> distribution!" 
> 
> I would like to scan the text using AND and OR operators and gets -1 or 
> other value if the searching elements haven't found in the text.
> 
> Example 01:
> search criteria:  "FDA" AND ( "approve*" OR "supported")
> The catch is that in Text variable FDA and approve words  are not one 
> after another (other words are in between).

Bring on your hardest searches...

class Pattern(object): pass

class Logical(Pattern):
     def __init__(self, pat1, pat2):
         self.pat1 = pat1
         self.pat2 = pat2
     def __call__(self, text):
         a, b = self.pat1(text), self.pat2(text)
         if self.op(a != len(text), b != len(text)):
             return min((a, b))
         return len(text)
     def __str__(self):
         return '(%s %s %s)' % (self.pat1, self.op_name, self.pat2)

class P(Pattern):
     def __init__(self, pat):
         self.pat = pat
     def __call__(self, text):
         ret = text.find(self.pat)
         return ret if ret != -1 else len(text)
     def __str__(self):
         return '"%s"' % self.pat

class NOT(Pattern):
     def __init__(self, pat):
         self.op_name = 'NOT'
         self.pat = pat
     def __call__(self, text):
         ret = self.pat(text)
         return ret - 1 if ret == len(text) else len(text)
     def __str__(self):
         return '%s (%s)' % (self.op_name, self.pat)

class XOR(Logical):
     def __init__(self, pat1, pat2):
         self.op_name = 'XOR'
         self.op = lambda a, b: not(a and b) and (a or b)
         super().__init__(pat1, pat2)

class OR(Logical):
     def __init__(self, pat1, pat2):
         self.op_name = 'OR'
         self.op = lambda a, b: a or b
         super().__init__(pat1, pat2)

class AND(Logical):
     def __init__(self, pat1, pat2):
         self.op_name = 'AND'
         self.op = lambda a, b: a and b
         super().__init__(pat1, pat2)

class Suite(object):
     def __init__(self, pat):
         self.pat = pat
     def __call__(self, text):
         ret = self.pat(text)
         return ret if ret != len(text) else -1
     def __str__(self):
         return '[%s]' % self.pat

pat1 = P('FDA')
pat2 = P('approve*')
pat3 = P('supported')
p = Suite(AND(pat1, OR(pat2, pat3)))
print(p(''))
print(p('FDA'))
print(p('FDA supported'))
print(p('supported FDA'))
print(p('blah FDA bloh supported blih'))
print(p('blah FDA bleh supported bloh supported blih '))
p = Suite(AND(OR(pat1, pat2), XOR(pat2, NOT(pat3))))
print(p)
print(p(''))
print(p('FDA'))
print(p('FDA supported'))
print(p('supported sdc FDA sd'))
print(p('blah blih FDA bluh'))
print(p('blah blif supported blog'))

#################

I guess I went a bit overboard here (had too much time on hand), the 
working is based on function composition, so instead of evaluation, you 
composes a function (or more accurately, a callable class) that will 
evaluate the logical value and return the index of the first item that 
matches the logical expression. It currently uses str's builtin find, 
but I guess it wouldn't be very hard to adapt it to use the re myfind() 
below (only P class will need to change)

The Suite class is only there to turn the NotFound sentinel from 
len(text) to -1 (used len(text) since it simplifies the code a lot...)

Caveat: The NOT class cannot reliably convert a False to True because I 
don't know what index number to use.

Code written for efficient vertical space, not the most readable in the 
world.

No guarantee no bug.

Idea:
Overrides the operator on Pattern class so we could write it like: 
P("Hello") & P("World") instead of AND(P("Hello"), P("World"))

> Example 02:
> search criteria: "Ben"
> The catch is that code sould find only exact Ben words not also words 
> which that has firts three letters Ben such as Benquick, Benseek etc.. 
> Only Ben is the right word we are looking for.

The second one was easier...

import re
def myfind(pattern, text):
     pattern = r'(.*?)\b(%s)\b(.*)' % pattern
     m = re.match(pattern, text)
     if m:
         return len(m.group(1))

textfound = 'This is a Ben test string'
texttrick = 'This is a Benquick Benseek McBen QuickBenSeek string'
textnotfound = 'He is away'
textmulti = 'Our Ben found another Ben which is quite odd'
pat = 'Ben'
print(myfind(pat, textfound))    # 10
print(myfind(pat, texttrick))    # None
print(myfind(pat, textnotfound)) # None
print(myfind(pat, textmulti))    # 4

if you only want to test for existence, simply:

pattern = 'Ben'
if re.match(r'(.*?)\b(%s)\b(.*)' % pattern, text):
     pass

> I would really appreciated your advice - code sample / links how above 
> can be achieved! if possible I would appreciated solution achieved 
> with free of charge module.

Standard library is free of charge, no?



More information about the Tutor mailing list