Hopefully simple regular expression question
John Machin
sjmachin at lexicon.net
Tue Jun 14 08:15:20 EDT 2005
peterbe at gmail.com wrote:
> I want to match a word against a string such that 'peter' is found in
> "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
> "hey peterbe," because the word has to stand on its own. The following
> code works for a single word:
>
> def createStandaloneWordRegex(word):
> """ return a regular expression that can find 'peter' only if it's
> written
> alone (next to space, start of string, end of string, comma, etc)
> but
> not if inside another word like peterbe """
> return re.compile(r"""
> (
> ^ %s
> (?=\W | $)
> |
> (?<=\W)
> %s
> (?=\W | $)
> )
> """% (word, word), re.I|re.L|re.M|re.X)
>
>
> def test_createStandaloneWordRegex():
> def T(word, text):
> print createStandaloneWordRegex(word).findall(text)
>
> T("peter", "So Peter Bengtsson wrote this")
> T("peter", "peter")
> T("peter bengtsson", "So Peter Bengtsson wrote this")
>
> The result of running this is::
>
> ['Peter']
> ['peter']
> [] <--- this is the problem!!
>
>
> It works if the parameter is just one word (eg. 'peter') but stops
> working when it's an expression (eg. 'peter bengtsson')
No, not when it's an "expression" (whatever that means), but when the
parameter contains whitespace, which is ignored in verbose mode.
>
> How do I modify my regular expression to match on expressions as well
> as just single words??
>
If you must stick with re.X, you must escape any whitespace characters
in your "word" -- see re.escape().
Alternatively (1), drop re.X but this is ugly:
regex_text_no_X = r"(^%s(?=\W|$)|(?<=\W)%s(?=\W|$))" % (word, word)
Alternatively (2), consider using the \b gadget; this appears to give
the same answers as the baroque method:
regex_text_no_flab = r"\b%s\b" % word
HTH,
John
More information about the Python-list
mailing list