[Python-Dev] Behavior of matching backreferences

Sun, 23 Jun 2002 22:24:42 -0400

[Gustavo Niemeyer]
> I still think it should, because otherwise the "^(a)?b\1$" can never be
> used, and this expression will become "^((a)?)b\1$" if more than one
> character is needed.

Is that a real concern?  I mean that in the sense of whether you have an
actual application requiring that some multi-character bracketing string
either does or doesn't appear on both ends of a thing, and typing another
set of parens is a burden.  Both parts of that seem strained.

> But since nobody agrees with me, and both languages are doing it that
> way, I give up. :-)

That's wise <wink>.  It's not just Python and Perl, I expect you're going to
find this in every careful regexp package.  There's a painful discussion
buried here:

<http://standards.ieee.org/reading/ieee/interp/1003-2-92_int/pasc-1003.2-43.
html>

wherein the POSIX committee debated their own ambiguous wording about
backreferences.  Their specific example is:  what should the regexp (in
Python notation, not POSIX)

    ^((.)*\2#)*

match in

    xx#yy##

?  Your example is hiding in there, on the "third iteration of the outer
loop".  The official POSIX interpretation was that it should match just the
first 6 characters, and not the trailing #,

    because in a third iteration of the outer subexpression, . would match
    nothing (as distinct from matching a null string) and hence \2 would
    match nothing.

Python and Perl agree, which wouldn't surprise you if you first implemented
a regexp engine with stinking backreferences <0.9 wink>.  The distinction
between "matched an empty string" and "didn't match anything" is night-&-day
inside an engine, and people skating on the edge (meaning using
backreferences at all!) quickly rely on the exact behavior this implies.

> Could you please reject the patch at SF?

I'm not sure which one you mean, so on your authority I'm going to reject
all patches at SF.  Whew!  This makes our job much easier <wink>.