On Sat, Nov 10, 2018 at 12:42 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
I just saw some document which reminded me that strings with a backslash followed by 3 octal digits. When a backslash is followed by 3 octal digits, that means a character with the corresponding codepoint and all is well.
The "valid scenaario":
In [42]: "\777" Out[42]: 'ǿ'
The problem is when you have just two valid octal digits
In [40]: "\778" Out[40]: '?8'
Which is ambiguous at least -- why is this not "\x07" "77" for example? (0ct(77) actually corresponds to the "?" (63) character)
Not ambiguous. It takes as many valid octal digits as it can. https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-l... \ooo ==> Character with octal value ooo Note 1: As in Standard C, up to three octal digits are accepted. "Up to" means that one or two digits can also define a character. For obvious reasons, it has to take digits greedily (otherwise "\777" would be "\x07" followed by "77"), and it's not an error to have fewer digits. Permitting a single digit means that "\0" means the NUL character, which is often convenient.
And then when the second digit is not valid octal: In [43]: "\797" Out[43]: '\x0797' WAT?
So, between the possibly ambiguous scenario with two octal digits followed by a no-octal digit, and the complety unexpected expansion to a 4-hexadecimal digit codepoint in the last case
You may possibly be misinterpreting the last result. It's exactly the same as the previous ones.
list("\797") ['\x07', '9', '7']
The octal escape grabs as many digits as it can, and when it finds a character in the literal that isn't a valid octal digit (same whether it's a '9' or a 'q'), it stops. The remaining characters have no special meaning; this does not become four hex digits. A "\xNN" escape in Python must be exactly two digits, no more and no less.
what do you say of deprecating any r"\[0-9]{1,3}" sequence that don't match full 3 octal digits, and yield a syntax error for that from Python 3.9 (or 3.10) on?
Nope. Would break code for no good reason. ChrisA