Behavior of matching backreferences
Hi everyone! I was studying the sre module, when I came up with the following regular expression: re.compile("^(?P<a>a)?(?P=a)$").match("ebc").groups() The (?P=a) matches with whatever was matched by the "a" group. If "a" is optional and doesn't match, it seems to make sense that (?P=a) becomes optional as well, instead of failing. Otherwise the regular expression above will allways fail if the first group fails, even being optional. One could argue that to make it a valid regular expression, it should become "^(?P<a>a)?(?P=a)?". But that's a different regular expression, since it would match "a", while the regular expression above would match "aa" or "", but not "a". This kind of pattern is useful, for example, to match a string which could be optionally surrounded by quotes, like shell variables. Here's an example of such pattern: r"^(?P<a>')?((?:\\'|[^'])*)(?P=a)$". This pattern matches "'a'", "\'a", "a\'a", "'a\'a'" and all such variants, but not "'a", "a'", or "a'a". I've submitted a patch to make this work to http://python.org/sf/571976 -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
I think the re module worked correctly. If you write your expression without the ambiguity: yours: "^(?P<a>a)?(?P=a)$" re-1a: "^((?P<a>a)(?P=a))?$" re-2a: "^(?P<a>a?)(?P=a)$" your test data ebc will does not match either 'aa' or ''. Try removing the $ so that it will match '' at the start of the string. re-1b: "^((?P<a>a)(?P=a))?" re-2b: "^(?P<a>a?)(?P=a)" I think the re-2b form is the way to deal with the optional quotes. I'm not sure a patch is needed for this. BArry -----Original Message----- From: python-dev-admin@python.org [mailto:python-dev-admin@python.org]On Behalf Of Gustavo Niemeyer Sent: 21 June 2002 06:07 To: python-dev@python.org Subject: [Python-Dev] Behavior of matching backreferences Hi everyone! I was studying the sre module, when I came up with the following regular expression: re.compile("^(?P<a>a)?(?P=a)$").match("ebc").groups() The (?P=a) matches with whatever was matched by the "a" group. If "a" is optional and doesn't match, it seems to make sense that (?P=a) becomes optional as well, instead of failing. Otherwise the regular expression above will allways fail if the first group fails, even being optional. One could argue that to make it a valid regular expression, it should become "^(?P<a>a)?(?P=a)?". But that's a different regular expression, since it would match "a", while the regular expression above would match "aa" or "", but not "a". This kind of pattern is useful, for example, to match a string which could be optionally surrounded by quotes, like shell variables. Here's an example of such pattern: r"^(?P<a>')?((?:\\'|[^'])*)(?P=a)$". This pattern matches "'a'", "\'a", "a\'a", "'a\'a'" and all such variants, but not "'a", "a'", or "a'a". I've submitted a patch to make this work to http://python.org/sf/571976 -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ] _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev
I think the re module worked correctly.
If you write your expression without the ambiguity:
I must confess I see no ambiguity in my expression.
yours: "^(?P<a>a)?(?P=a)$" re-1a: "^((?P<a>a)(?P=a))?$"
Using "aa" was just an example, of course. If I wanted to match "aa" or "", I wouldn't use this at all.
re-2a: "^(?P<a>a?)(?P=a)$"
your test data ebc will does not match either 'aa' or ''. Try removing the $ so that it will match '' at the start of the string.
Sorry, I took the wrong test to paste into the message.
re-1b: "^((?P<a>a)(?P=a))?" re-2b: "^(?P<a>a?)(?P=a)"
I think the re-2b form is the way to deal with the optional quotes.
I'm not sure a patch is needed for this.
If you think about a match with more characters, you'll end up in something like "^(?P<a>(abc)?)(?P=a)", instead of "^(?P<a>abc)?(?P=a)". Besides having a little difference in their meanings (the first m.group(1) is '', and the second is None), it looks like you're workarounding an existant problem, but you may argue that this opinion is something personal. Thus, my main point here is that using the second regular expression will never work as expected, and there is no point in not fixing it, if that's possible and has already been done. If you find an example where it *should* fail, working as it is now, I promiss I'll shut up, and withdraw myself. :-) -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
I think your re has a bug in it that in python would be if cond: a = 1 print a python will give an error is cond is false. An re that defines a group conditionally as yours does I think is the same programming error. That's the ambiguity I am referring to, is or is not the named group defined?
If you think about a match with more characters, you'll end up in something like "^(?P<a>(abc)?)(?P=a)", instead of "^(?P<a>abc)?(?P=a)". Besides having a little difference in their meanings (the first m.group(1) is '', and the second is None), it looks like you're workarounding an existant problem, but you may argue that this opinion is something personal.
You can prevent groups being remember using the (?:...) syntax if you need to preserve the group index. So you need: "^(?P<a>(?:abc)?)(?P=a)" I'm not convinced you have found a bug in the engine that needs fixing, I think its your re needs changing. I want the re engine to report the error for re that are illogical. BArry
I think your re has a bug in it that in python would be
if cond: a = 1 print a
python will give an error is cond is false.
An re that defines a group conditionally as yours does I think is the same programming error. That's the ambiguity I am referring to, is or is not the named group defined?
Sorry Barry, but I don't see your point here. There's no change in the naming semantics. In sre that's totally valid and used in a lot of code:
`re.compile("(?P<a>a)?").match("b").group("a")` 'None' `re.compile("(?P<a>a)?").match("a").group("a")` "'a'"
[...]
You can prevent groups being remember using the (?:...) syntax if you need to preserve the group index. So you need:
"^(?P<a>(?:abc)?)(?P=a)"
Again, you may do regular expressions in many ways, the point I'm still raising is that there's one way that doesn't work as expected.
I'm not convinced you have found a bug in the engine that needs fixing, I think its your re needs changing. I want the re engine to report the error for re that are illogical.
The re won't report anything when somebody uses this syntax. It will just don't work as expected. If you think this re is illogical, don't use it. But I see no point in denying others to use it. I'm not planning to discuss much more about this. My intentions and the issue are clear enough. I'd like to hear the opinion of Fredrik about this, though. -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
On Sun, 23 Jun 2002, Gustavo Niemeyer wrote: I do not agree with both of you. I think, re should give an error at compile time (as it does in cases, like (?<=REGEXP), where only fixed length is allowed:
re.compile("(?<=R*)") Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.2/sre.py", line 178, in compile return _compile(pattern, flags) File "/usr/lib/python2.2/sre.py", line 228, in _compile raise error, v # invalid expression sre_constants.error: look-behind requires fixed-width pattern
Why? Because there is no sense in matching non-existent group. It's simply incorrect. So, instead of having time-bombs Gustavo found, it's better to check at re compile time.
Sorry Barry, but I don't see your point here. There's no change in the naming semantics. In sre that's totally valid and used in a lot of code:
`re.compile("(?P<a>a)?").match("b").group("a")` 'None' `re.compile("(?P<a>a)?").match("a").group("a")` "'a'"
This is quite different. None has a sense of meta-value which indicates that group was not used while matching. There is no way to use it in the re consistently. (well, probably some syntax could be invented for it, like 'match only if exists', etc. But it is too subtle and is hardly needed). Sincerely yours, Roman Suzi -- rnd@onego.ru =\= My AI powered by Linux RedHat 7.2
[Gustavo Niemeyer, on the behavior of re.compile("^(?P<a>a)?(?P=a)$").match("ebc").groups() ] Python and Perl work exactly the same way for the equivalent (but spellable in Perl) regexp ^(a)?\1$ matching the two strings a and aa and nothing else. That's what I expected. You didn't give a concrete example of what you think it should do instead. It may have been your intent to say that you believe the regexp *should* match the string ebc but you didn't really say so one way or the other. Regardless, neither Python nor Perl do match ebc in this case, and that's intended. The Rule, in vague English, is that a backreference matches the same text as was matched by the referenced group; if the referenced group didn't match any text, then the backreference can't match either. Note that whether the referenced group matched any text is a different question than whether the referenced group is *used* in the match. This is a subtle point I suspect you're missing.
Otherwise the regular expression above will allways fail if the first group fails,
Yes.
even being optional
There's no such beast as "an optional group". The ^(a) part *must* match or the entire regexp fails, period, regardless of whether or not backreferences appear later. The question mark following doesn't change this requirement. (a)? says 'a' must match but the overall pattern can choose to use this match or not That's why the regexp as a whole matches the string a The (a) part does match 'a', the ? chooses not to use this match, and then the backreference matches the 'a' that the first group matched. Study the output of this and it may be clearer: import re pat = re.compile(r"^((a)?)(\2)$") print pat.match('a').groups() print pat.match('aa').groups()
... while the regular expression above would match "aa" or "", but not "a".
As above, Python and Perl disagree with you: they match "aa" and "a" but not "".
... My intentions and the issue are clear enough.
Sorry, your intentions weren't clear to me. The issue is, though <wink>.
[Tim]
.. There's no such beast as "an optional group". The
^(a)
part *must* match or the entire regexp fails, period, regardless of whether or not backreferences appear later. The question mark following doesn't change this requirement. ...
Wow, yesterday's drugs haven't worn off yet <wink>. The details of this explanation were partly full of beans. Let's consider a different regexp: ^(a)?b\1$ Should that match b or not? Python and Perl say "no" today, because \1 refers to a group that didn't match. Ir remains unclear to me whether Gustavo is saying it should, but, if he is, that's too big a change, and ^(a?)b\1$ is the intended way to spell it.
Hello Tim!
Wow, yesterday's drugs haven't worn off yet <wink>. The details of this explanation were partly full of beans. Let's consider a different regexp: [...]
Thanks for explaining again, in a way I could understand. :-)
^(a)?b\1$
Should that match
b
or not? Python and Perl say "no" today, because \1 refers to a group that didn't match. Ir remains unclear to me whether Gustavo is saying it should, but, if he is, that's too big a change, and
^(a?)b\1$
[...] I still think it should, because otherwise the "^(a)?b\1$" can never be used, and this expression will become "^((a)?)b\1$" if more than one character is needed. But since nobody agrees with me, and both languages are doing it that way, I give up. :-) Could you please reject the patch at SF? Thank you! -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
[Gustavo Niemeyer]
I still think it should, because otherwise the "^(a)?b\1$" can never be used, and this expression will become "^((a)?)b\1$" if more than one character is needed.
Is that a real concern? I mean that in the sense of whether you have an actual application requiring that some multi-character bracketing string either does or doesn't appear on both ends of a thing, and typing another set of parens is a burden. Both parts of that seem strained.
But since nobody agrees with me, and both languages are doing it that way, I give up. :-)
That's wise <wink>. It's not just Python and Perl, I expect you're going to find this in every careful regexp package. There's a painful discussion buried here: <http://standards.ieee.org/reading/ieee/interp/1003-2-92_int/pasc-1003.2-43. html> wherein the POSIX committee debated their own ambiguous wording about backreferences. Their specific example is: what should the regexp (in Python notation, not POSIX) ^((.)*\2#)* match in xx#yy## ? Your example is hiding in there, on the "third iteration of the outer loop". The official POSIX interpretation was that it should match just the first 6 characters, and not the trailing #, because in a third iteration of the outer subexpression, . would match nothing (as distinct from matching a null string) and hence \2 would match nothing. Python and Perl agree, which wouldn't surprise you if you first implemented a regexp engine with stinking backreferences <0.9 wink>. The distinction between "matched an empty string" and "didn't match anything" is night-&-day inside an engine, and people skating on the edge (meaning using backreferences at all!) quickly rely on the exact behavior this implies.
Could you please reject the patch at SF?
I'm not sure which one you mean, so on your authority I'm going to reject all patches at SF. Whew! This makes our job much easier <wink>.
I still think it should, because otherwise the "^(a)?b\1$" can never be used, and this expression will become "^((a)?)b\1$" if more than one character is needed.
Is that a real concern? I mean that in the sense of whether you have an actual application requiring that some multi-character bracketing string either does or doesn't appear on both ends of a thing, and typing another set of parens is a burden. Both parts of that seem strained.
No, it isn't. Even because there is some way to implement this, as Barry and you have shown, and because *I* know it doesn't work as I'd expect. :-)) Indeed, I've found it while implementing another feature which in my opinion is really useful, and can't be easily achieved. But that's something for another thread, another day. [...]
? Your example is hiding in there, on the "third iteration of the outer loop". The official POSIX interpretation was that it should match just the first 6 characters, and not the trailing #,
because in a third iteration of the outer subexpression, . would match nothing (as distinct from matching a null string) and hence \2 would match nothing. [...]
Thanks for giving me a strong and detailed reason. I understand that small issues can end up in endless discussions and different implementations. I'm happy that the POSIX people thought about that before me <2.0 wink>.
Could you please reject the patch at SF?
I'm not sure which one you mean, so on your authority I'm going to reject all patches at SF. Whew! This makes our job much easier <wink>.
That's good! You'll take back the time you wasted with me. ;-)) -- Gustavo Niemeyer [ 2AAC 7928 0FBF 0299 5EB5 60E2 2253 B29A 6664 3A0C ]
participants (4)
-
Barry Scott
-
Gustavo Niemeyer
-
Roman Suzi
-
Tim Peters