[Python-Dev] Behavior of matching backreferences

Gustavo Niemeyer niemeyer@conectiva.com
Fri, 21 Jun 2002 02:07:25 -0300


Hi everyone!

I was studying the sre module, when I came up with the following
regular expression:

re.compile("^(?P<a>a)?(?P=a)$").match("ebc").groups()

The (?P=a) matches with whatever was matched by the "a" group. If
"a" is optional and doesn't match, it seems to make sense that
(?P=a) becomes optional as well, instead of failing. Otherwise the
regular expression above will allways fail if the first group
fails, even being optional.

One could argue that to make it a valid regular expression, it should
become "^(?P<a>a)?(?P=a)?". But that's a different regular expression,
since it would match "a", while the regular expression above would
match "aa" or "", but not "a".

This kind of pattern is useful, for example, to match a string which
could be optionally surrounded by quotes, like shell variables. Here's
an example of such pattern: r"^(?P<a>')?((?:\\'|[^'])*)(?P=a)$".
This pattern matches "'a'", "\'a", "a\'a", "'a\'a'" and all such
variants, but not "'a", "a'", or "a'a".

I've submitted a patch to make this work to http://python.org/sf/571976

-- 
Gustavo Niemeyer

[ 2AAC 7928 0FBF 0299 5EB5  60E2 2253 B29A 6664 3A0C ]