Python octal escape character encoding "wats"

I just saw some document which reminded me that strings with a backslash followed by 3 octal digits. When a backslash is followed by 3 octal digits, that means a character with the corresponding codepoint and all is well. The "valid scenaario": In [42]: "\777" Out[42]: 'ǿ' The problem is when you have just two valid octal digits In [40]: "\778" Out[40]: '?8' Which is ambiguous at least -- why is this not "\x07" "77" for example? (0ct(77) actually corresponds to the "?" (63) character) Or...when the first digit is not valid as octal - that is: In [41]: "\877" Out[41]: '\\877' And then when the second digit is not valid octal: In [43]: "\797" Out[43]: '\x0797' WAT? So, between the possibly ambiguous scenario with two octal digits followed by a no-octal digit, and the complety unexpected expansion to a 4-hexadecimal digit codepoint in the last case, what do you say of deprecating any r"\[0-9]{1,3}" sequence that don't match full 3 octal digits, and yield a syntax error for that from Python 3.9 (or 3.10) on? Best regards, js -><-

On Sat, Nov 10, 2018 at 12:42 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Not ambiguous. It takes as many valid octal digits as it can. https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-l... \ooo ==> Character with octal value ooo Note 1: As in Standard C, up to three octal digits are accepted. "Up to" means that one or two digits can also define a character. For obvious reasons, it has to take digits greedily (otherwise "\777" would be "\x07" followed by "77"), and it's not an error to have fewer digits. Permitting a single digit means that "\0" means the NUL character, which is often convenient.
You may possibly be misinterpreting the last result. It's exactly the same as the previous ones.
list("\797") ['\x07', '9', '7']
The octal escape grabs as many digits as it can, and when it finds a character in the literal that isn't a valid octal digit (same whether it's a '9' or a 'q'), it stops. The remaining characters have no special meaning; this does not become four hex digits. A "\xNN" escape in Python must be exactly two digits, no more and no less.
Nope. Would break code for no good reason. ChrisA

On Fri, 9 Nov 2018 at 23:56, Chris Angelico <rosuav@gmail.com> wrote:
list("\797") ['\x07', '9', '7']
Yes- I had just figured this out before going to sleep, and was comming back that although strange, this was no motive for breaking stuff up. Thank your for the lengthy reply!!

On Sat, Nov 10, 2018 at 12:56:07PM +1100, Chris Angelico wrote:
Not ambiguous. It takes as many valid octal digits as it can.
What is the rationale for that? Hex escapes don't. My guess is, "Because that's what C does". And C probably does it because "Dennis Ritchie wanted to minimize the number of keypresses when he was typing" :-)
In hindsight, I think we should have insisted that octal escapes must always be three digits, just as hex escapes are always two. The status quo has too much magical "Do What I Mean" in it for my liking: py> '\509\51' # pair of brackets surrounding a nine '(9)' py> '\507\51' # pair of brackets surrounding a seven 'G)' Dammit Python, that's not what I meant!
There's a good reason: to make the behaviour more sensible and less confusing and have fewer "oops, that's not what I wanted" bugs. But we should have made that change for 3.0. Now, I agree: it would be breakage where the benefit doesn't outweigh the cost. Maybe in Python 5000. In the meantime, one or two digit octal escapes ought to be a linter warning. -- Steve

On Sat, Nov 10, 2018 at 3:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
Irrelevant to whether it's ambiguous or not.
How often do you actually do that with octal escapes, though? Ever had actual real-world situations where this comes up? I don't recall *ever* coming across a problem where sometimes I have an octal escape followed by a nine, and other times by a different digit. I also do not recall often wanting an octal escape followed by a digit, even without that confusion.
We can debate whether it would be, in the abstract, better to mandate exactly three digits, or to allow fewer. But I think we're all agreed that it is nowhere _near_ enough of a problem to justify the breakage. I perhaps exaggerated slightly in saying "no" good reason, but certainly not enough to consider the change.
Maybe. Or just have the editor colour the octal escape differently; that way, the end of the colour will tell you if the language is misinterpreting your intentions. Either way, yeah, something that tooling can help with. ChrisA

On 11/9/18 11:19 PM, Steven D'Aprano wrote:
Since the 'normal' usage for octal escapes in C (which came long before hex escapes) was to input control characters, the most likely being \0, and the next most likely \33 (Escape), and by far most being in the range of \0 - \37, requiring 3 all the time would be very inconvenient. You would never use the escape for a printable character and interleave it with other printable characters. Yes, if you are putting in codes for a string of arbitrary byte values using escapes, then you would likely always use 3 digits for readability, but then you don't have the ambiguity as EVERY code is an escape. The one case where you might get the problem is if you had a control character (like escape) followed by a digit between 0 and 7, you needed to expand the escape to 3 digits. This was just one of the traps you learned to live with (and it seemed that terminal escape codes seemed to avoid that issue by normally following the escape character with a non-digit character.) -- Richard Damon

On Sat, Nov 10, 2018 at 12:42 PM Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Not ambiguous. It takes as many valid octal digits as it can. https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-l... \ooo ==> Character with octal value ooo Note 1: As in Standard C, up to three octal digits are accepted. "Up to" means that one or two digits can also define a character. For obvious reasons, it has to take digits greedily (otherwise "\777" would be "\x07" followed by "77"), and it's not an error to have fewer digits. Permitting a single digit means that "\0" means the NUL character, which is often convenient.
You may possibly be misinterpreting the last result. It's exactly the same as the previous ones.
list("\797") ['\x07', '9', '7']
The octal escape grabs as many digits as it can, and when it finds a character in the literal that isn't a valid octal digit (same whether it's a '9' or a 'q'), it stops. The remaining characters have no special meaning; this does not become four hex digits. A "\xNN" escape in Python must be exactly two digits, no more and no less.
Nope. Would break code for no good reason. ChrisA

On Fri, 9 Nov 2018 at 23:56, Chris Angelico <rosuav@gmail.com> wrote:
list("\797") ['\x07', '9', '7']
Yes- I had just figured this out before going to sleep, and was comming back that although strange, this was no motive for breaking stuff up. Thank your for the lengthy reply!!

On Sat, Nov 10, 2018 at 12:56:07PM +1100, Chris Angelico wrote:
Not ambiguous. It takes as many valid octal digits as it can.
What is the rationale for that? Hex escapes don't. My guess is, "Because that's what C does". And C probably does it because "Dennis Ritchie wanted to minimize the number of keypresses when he was typing" :-)
In hindsight, I think we should have insisted that octal escapes must always be three digits, just as hex escapes are always two. The status quo has too much magical "Do What I Mean" in it for my liking: py> '\509\51' # pair of brackets surrounding a nine '(9)' py> '\507\51' # pair of brackets surrounding a seven 'G)' Dammit Python, that's not what I meant!
There's a good reason: to make the behaviour more sensible and less confusing and have fewer "oops, that's not what I wanted" bugs. But we should have made that change for 3.0. Now, I agree: it would be breakage where the benefit doesn't outweigh the cost. Maybe in Python 5000. In the meantime, one or two digit octal escapes ought to be a linter warning. -- Steve

On Sat, Nov 10, 2018 at 3:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
Irrelevant to whether it's ambiguous or not.
How often do you actually do that with octal escapes, though? Ever had actual real-world situations where this comes up? I don't recall *ever* coming across a problem where sometimes I have an octal escape followed by a nine, and other times by a different digit. I also do not recall often wanting an octal escape followed by a digit, even without that confusion.
We can debate whether it would be, in the abstract, better to mandate exactly three digits, or to allow fewer. But I think we're all agreed that it is nowhere _near_ enough of a problem to justify the breakage. I perhaps exaggerated slightly in saying "no" good reason, but certainly not enough to consider the change.
Maybe. Or just have the editor colour the octal escape differently; that way, the end of the colour will tell you if the language is misinterpreting your intentions. Either way, yeah, something that tooling can help with. ChrisA

On 11/9/18 11:19 PM, Steven D'Aprano wrote:
Since the 'normal' usage for octal escapes in C (which came long before hex escapes) was to input control characters, the most likely being \0, and the next most likely \33 (Escape), and by far most being in the range of \0 - \37, requiring 3 all the time would be very inconvenient. You would never use the escape for a printable character and interleave it with other printable characters. Yes, if you are putting in codes for a string of arbitrary byte values using escapes, then you would likely always use 3 digits for readability, but then you don't have the ambiguity as EVERY code is an escape. The one case where you might get the problem is if you had a control character (like escape) followed by a digit between 0 and 7, you needed to expand the escape to 3 digits. This was just one of the traps you learned to live with (and it seemed that terminal escape codes seemed to avoid that issue by normally following the escape character with a non-digit character.) -- Richard Damon
participants (4)
-
Chris Angelico
-
Joao S. O. Bueno
-
Richard Damon
-
Steven D'Aprano