can anyone tell me how Perl treats this pattern?
r'((((((((((a))))))))))\41'
in SRE, this is currently a couple of nested groups, surrounding a single literal, followed by a back reference to the fourth group, followed by a literal "1" (since there are less than 41 groups)
in PRE, it turns out that this is a syntax error; there's no group 41.
however, this test appears in the test suite under the section "all test from perl", but they're commented out:
# Python does not have the same rules for \41 so this is a syntax error # ('((((((((((a))))))))))\41', 'aa', FAIL), # ('((((((((((a))))))))))\41', 'a!', SUCCEED, 'found', 'a!'),
if I understand this correctly, Perl treats as an *octal* escape (chr(041) == "!").
now, should I emulate PRE, Perl, or leave it as it is...
</F>
PS. in case anyone wondered why I haven't seen this before, it's because I just discovered that the test suite masks syntax errors under some circumstances...
On Thu, Aug 31, 2000 at 09:46:54PM +0200, Fredrik Lundh wrote:
can anyone tell me how Perl treats this pattern? r'((((((((((a))))))))))\41'
if I understand this correctly, Perl treats as an *octal* escape (chr(041) == "!").
Correct. From perlre:
You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parentheses before the backreference. Otherwise (for backward compatibility) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
In other words, if there were 41 groups, \41 would be a backref to group 41; if there aren't, it's an octal escape. This magical behaviour was deemed not Pythonic, so pre uses a different rule: it's always a character inside a character class ([\41] isn't a syntax error), and outside a character class it's a character if there are exactly 3 octal digits; otherwise it's a backref. So \41 is a backref to group 41, but \041 is the literal character ASCII 33.
--amk
amk wrote:
outside a character class it's a character if there are exactly 3 octal digits; otherwise it's a backref. So \41 is a backref to group 41, but \041 is the literal character ASCII 33.
so what's the right way to parse this?
read up to three digits, check if they're a valid octal number, and treat them as a decimal group number if not?
</F>
amk wrote:
outside a character class it's a character if there are exactly 3 octal digits; otherwise it's a backref. So \41 is a backref to group 41, but \041 is the literal character ASCII 33.
so what's the right way to parse this?
read up to three digits, check if they're a valid octal number, and treat them as a decimal group number if not?
Suggestion:
If there are fewer than 3 digits, it's a group.
If there are exactly 3 digits and you have 100 or more groups, it's a group -- too bad, you lose octal number support. Use \x. :-)
If there are exactly 3 digits and you have at most 99 groups, it's an octal escape.
(Can you even have more than 99 groups in SRE?)
--Guido van Rossum (home page: http://www.pythonlabs.com/%7Eguido/)
guido wrote:
Suggestion:
If there are fewer than 3 digits, it's a group.
If there are exactly 3 digits and you have 100 or more groups, it's a group -- too bad, you lose octal number support. Use \x. :-)
If there are exactly 3 digits and you have at most 99 groups, it's an octal escape.
I had to add one rule:
If it starts with a zero, it's always an octal number. Up to two more octal digits are accepted after the leading zero.
but this still fails on this pattern:
r'(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\119'
where the last part is supposed to be a reference to group 11, followed by a literal '9'.
more ideas?
(Can you even have more than 99 groups in SRE?)
yes -- the current limit is 100 groups. but that's an artificial limit, and it should be removed.
</F>
On Thu, 31 Aug 2000, Fredrik Lundh wrote:
I had to add one rule:
If it starts with a zero, it's always an octal number. Up to two more octal digits are accepted after the leading zero.
Fewer rules are better. Let's not arbitrarily rule out the possibility of more than 100 groups.
The octal escapes are a different kind of animal than the backreferences: for a backreference, there is *actually* a backslash followed by a number in the regular expression; but we already have a reasonable way to put funny characters into regular expressions.
That is, i propose *removing* the translation of octal escapes from the regular expression engine. That's the job of the string literal:
r'\011' is a backreference to group 11
'\011' is a backreference to group 11
'\011' is a tab character
This makes automatic construction of regular expressions a tractable problem. We don't want to introduce so many exceptional cases that an attempt to automatically build regular expressions will turn into a nightmare of special cases.
-- ?!ng
[/F]
I had to add one rule:
If it starts with a zero, it's always an octal number. Up to two more octal digits are accepted after the leading zero.
but this still fails on this pattern:
r'(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\119'
where the last part is supposed to be a reference to group 11, followed by a literal '9'.
But 9 isn't an octal digit, so it fits w/ your new rule just fine. \117 here instead would be an octal escape.
tim:
[/F]
I had to add one rule:
If it starts with a zero, it's always an octal number. Up to two more octal digits are accepted after the leading zero.
but this still fails on this pattern:
r'(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\119'
where the last part is supposed to be a reference to group 11, followed by a literal '9'.
But 9 isn't an octal digit, so it fits w/ your new rule just fine.
last time I checked, "1" wasn't a valid zero.
but nevermind; I think I've figured it out (see other mail)
</F>
Suggestion:
If there are fewer than 3 digits, it's a group.
Unless it begins with a 0 (that's what's documented today -- read the docs <wink>).
If there are exactly 3 digits and you have 100 or more groups, it's a group -- too bad, you lose octal number support. Use \x. :-)
The docs say you can't use backreferences for groups higher than 99.
If there are exactly 3 digits and you have at most 99 groups, it's an octal escape.
If we make the meaning depend on the number of preceding groups, we may as well emulate *all* of Perl's ugliness here.
The PRE documentation expresses the true intent:
\number Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the "[" and "]" of a character class, all numeric escapes are treated as characters
This was discussed at length when we decided to go the Perl-compatible route, and Perl's rules for backreferences were agreed to be just too ugly to emulate. The meaning of \oo in Perl depends on how many groups precede it! In this case, there are fewer than 41 groups, so Perl says "octal escape"; but if 41 or more groups had preceded, it would mean "backreference" instead(!). Simply unbearably ugly and error-prone.
-----Original Message----- From: python-dev-admin@python.org [mailto:python-dev-admin@python.org]On Behalf Of Fredrik Lundh Sent: Thursday, August 31, 2000 3:47 PM To: python-dev@python.org Subject: [Python-Dev] one last SRE headache
can anyone tell me how Perl treats this pattern?
r'((((((((((a))))))))))\41'
in SRE, this is currently a couple of nested groups, surrounding a single literal, followed by a back reference to the fourth group, followed by a literal "1" (since there are less than 41 groups)
in PRE, it turns out that this is a syntax error; there's no group 41.
however, this test appears in the test suite under the section "all test from perl", but they're commented out:
# Python does not have the same rules for \41 so this is a syntax error # ('((((((((((a))))))))))\41', 'aa', FAIL), # ('((((((((((a))))))))))\41', 'a!', SUCCEED, 'found', 'a!'),
if I understand this correctly, Perl treats as an *octal* escape (chr(041) == "!").
now, should I emulate PRE, Perl, or leave it as it is...
</F>
PS. in case anyone wondered why I haven't seen this before, it's because I just discovered that the test suite masks syntax errors under some circumstances...
Python-Dev mailing list Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev
tim peters:
The PRE documentation expresses the true intent:
\number Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number.
yeah, I've read that. clear as coffee.
but looking at again, I suppose that the right way to implement this is (doing the tests in the given order):
if it starts with zero, it's an octal escape (1 or 2 octal digits may follow)
if it starts with an octal digit, AND is followed by two other octal digits, it's an octal escape
if it starts with any digit, it's a reference (1 extra decimal digit may follow)
oh well. too bad my scanner only provides a one-character lookahead...
</F>