Mailman 3 Deprecate misleading escapes in strings - Python-ideas

16 Feb 2023

      Hi all!

I was writing a tutorial on the distinction between bytes and strings
and why it is important, when I saw the root cause.  People coming from
C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for
b"\x90" often.  My idea is that Python could deprecate string literals
containing any non-ASCII codepoints specified in any way different from
unicode or unicode escapes (\u, \U, \N).

(Actually I found that I started having the idea already back in 2021 on
StackOverflow[1].  The question is an excellent example of what I mean.)

I would not go so far to follow JSON (disallowing \x11 and \222 escapes
completely), but while writing "\x00" or "\0" is useful and widely used,
"\x99" (and especially "\777"!) is probably marginal and definitely less
explicit than "\u0099" (in the Zen of explicit better than implicit).
Byte strings do not treat b"\u00ff" as b"\xff".

In the first part of implementing it, Python could raise a SyntaxWarning
(or should it be DeprecationWarning? BytesWarning?), suggesting "\x99"
to either become b"\u0099" or b"\x99", eventually promoting it to some
equally helpful SyntaxError.  All of it could be hidden behind a feature
like from __future__ import backslashes (one nice name I can think of).

The new regular expression for octals would be \\[01]?[0-7]{1,2} and
\\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone,
and not much more complex than the old ones.

In the meantime, probably between introducing a warning and changing it
to become an error (the most reasonable timeline I can think of now),
the default ascii() representation should eventually use the \u0099 form
for all such codepoints, to keep the invariant of eval(ascii(x)) == x
without syntax warnings.  repr() is also affected, but it is fortunately
limited to the [\x80-\xa0\xad] range.  I mean [\u0080-\u00a0\u00ad] :-)

Another timeline would be to change the repr first, initially hidden
under an interpreter flag or environment variable, then officially
deprecate it in the documentation, then introduce the error guarded by
from __future__ import backslashes or another flag, then make the repr
use \u by default, then add the warning and finally make it always raise
an error.
As a precedent, breaking repr() was not a dealbreaker when introducing
randomized seeds (even repr({"a", "b"}) is now unpredictable).

This would be of course a breaking change for a lot of unit tests, and
stuff like pickle should probably support old syntax, delaying any such
change until a new protocol comes (if it applies to the newest one ---
quite sure it does not).  Such a breaking change must be used wisely.
Other changes to octal escapes could be sneaked in, based on conclusions
from the 2018 'Python octal escape character encoding "wats"' thread[2]
(I like writing "\0" and "\4" though, just to make my opinion clear).
If going the whole hog, the 2015 'Make non-meaningful backslashes
illegal in string literals' thread[3] could be revived as well, maybe
even with "\f\v" deprecated, "\e" = "\33" introduced and such.

Please let me know what you think, what else could break, and is it
useful anywhere else apart from my use case, and what similar problems
you have.

Cheers,
Arusekk

[1]: https://stackoverflow.com/q/64832281/3869724
[2]: 
https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE...
[3]: 
https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ...

Deprecate misleading escapes in strings

Arusekk

tags

participants (6)