[Python-ideas] Python octal escape character encoding "wats"

Chris Angelico rosuav at gmail.com
Fri Nov 9 20:56:07 EST 2018


On Sat, Nov 10, 2018 at 12:42 PM Joao S. O. Bueno <jsbueno at python.org.br> wrote:
>
> I just saw some document which reminded me that strings with a
> backslash followed by 3 octal digits. When a backslash is followed by
> 3 octal digits, that means a character with the corresponding
> codepoint and all is well.
>
> The "valid scenaario":
>
> In [42]: "\777"
> Out[42]: 'ǿ'
>
> The problem is when you have just two valid octal digits
>
> In [40]: "\778"
> Out[40]: '?8'
>
> Which is ambiguous at least -- why is this not "\x07" "77" for
> example?  (0ct(77) actually corresponds to the "?" (63) character)

Not ambiguous. It takes as many valid octal digits as it can.

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

\ooo ==> Character with octal value ooo
Note 1: As in Standard C, up to three octal digits are accepted.

"Up to" means that one or two digits can also define a character. For
obvious reasons, it has to take digits greedily (otherwise "\777"
would be "\x07" followed by "77"), and it's not an error to have fewer
digits. Permitting a single digit means that "\0" means the NUL
character, which is often convenient.

> And then when the second digit is not valid octal:
> In [43]: "\797"
> Out[43]: '\x0797'
> WAT?
>
> So, between the possibly ambiguous scenario with two octal digits
> followed by a no-octal digit, and   the complety unexpected expansion
> to a 4-hexadecimal digit codepoint in the last case

You may possibly be misinterpreting the last result. It's exactly the
same as the previous ones.

>>> list("\797")
['\x07', '9', '7']

The octal escape grabs as many digits as it can, and when it finds a
character in the literal that isn't a valid octal digit (same whether
it's a '9' or a 'q'), it stops. The remaining characters have no
special meaning; this does not become four hex digits. A "\xNN" escape
in Python must be exactly two digits, no more and no less.

> what do you say
> of deprecating any r"\[0-9]{1,3}" sequence that don't match full 3
> octal digits, and yield a syntax error for that from Python 3.9 (or
> 3.10) on?

Nope. Would break code for no good reason.

ChrisA


More information about the Python-ideas mailing list