Hopefully simple regular expression question

John Machin sjmachin at lexicon.net
Tue Jun 14 08:15:20 EDT 2005


peterbe at gmail.com wrote:
> I want to match a word against a string such that 'peter' is found in
> "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
> "hey peterbe," because the word has to stand on its own. The following
> code works for a single word:
> 
> def createStandaloneWordRegex(word):
>     """ return a regular expression that can find 'peter' only if it's
> written
>     alone (next to space, start of string, end of string, comma, etc)
> but
>     not if inside another word like peterbe """
>     return re.compile(r"""
>       (
>       ^ %s
>       (?=\W | $)
>       |
>       (?<=\W)
>       %s
>       (?=\W | $)
>       )
>       """% (word, word), re.I|re.L|re.M|re.X)
> 
> 
> def test_createStandaloneWordRegex():
>     def T(word, text):
>         print createStandaloneWordRegex(word).findall(text)
> 
>     T("peter", "So Peter Bengtsson wrote this")
>     T("peter", "peter")
>     T("peter bengtsson", "So Peter Bengtsson wrote this")
> 
> The result of running this is::
> 
>  ['Peter']
>  ['peter']
>  []   <--- this is the problem!!
> 
> 
> It works if the parameter is just one word (eg. 'peter') but stops
> working when it's an expression (eg. 'peter bengtsson')

No, not when it's an "expression" (whatever that means), but when the 
parameter contains whitespace, which is ignored in verbose mode.

> 
> How do I modify my regular expression to match on expressions as well
> as just single words??
> 

If you must stick with re.X, you must escape any whitespace characters 
in your "word" -- see re.escape().

Alternatively (1), drop re.X but this is ugly:

regex_text_no_X = r"(^%s(?=\W|$)|(?<=\W)%s(?=\W|$))" % (word, word)

Alternatively (2), consider using the \b gadget; this appears to give 
the same answers as the baroque method:

regex_text_no_flab = r"\b%s\b" % word


HTH,
John






More information about the Python-list mailing list