Hopefully simple regular expression question
sjmachin at lexicon.net
Tue Jun 14 14:15:20 CEST 2005
peterbe at gmail.com wrote:
> I want to match a word against a string such that 'peter' is found in
> "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
> "hey peterbe," because the word has to stand on its own. The following
> code works for a single word:
> def createStandaloneWordRegex(word):
> """ return a regular expression that can find 'peter' only if it's
> alone (next to space, start of string, end of string, comma, etc)
> not if inside another word like peterbe """
> return re.compile(r"""
> ^ %s
> (?=\W | $)
> (?=\W | $)
> """% (word, word), re.I|re.L|re.M|re.X)
> def test_createStandaloneWordRegex():
> def T(word, text):
> print createStandaloneWordRegex(word).findall(text)
> T("peter", "So Peter Bengtsson wrote this")
> T("peter", "peter")
> T("peter bengtsson", "So Peter Bengtsson wrote this")
> The result of running this is::
>  <--- this is the problem!!
> It works if the parameter is just one word (eg. 'peter') but stops
> working when it's an expression (eg. 'peter bengtsson')
No, not when it's an "expression" (whatever that means), but when the
parameter contains whitespace, which is ignored in verbose mode.
> How do I modify my regular expression to match on expressions as well
> as just single words??
If you must stick with re.X, you must escape any whitespace characters
in your "word" -- see re.escape().
Alternatively (1), drop re.X but this is ugly:
regex_text_no_X = r"(^%s(?=\W|$)|(?<=\W)%s(?=\W|$))" % (word, word)
Alternatively (2), consider using the \b gadget; this appears to give
the same answers as the baroque method:
regex_text_no_flab = r"\b%s\b" % word
More information about the Python-list