On 16 Feb 2023, at 14:57, Arusekk <arek_koz@o2.pl> wrote:
Hi all!
I was writing a tutorial on the distinction between bytes and strings and why it is important, when I saw the root cause. People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" often. My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode or unicode escapes (\u, \U, \N).
(Actually I found that I started having the idea already back in 2021 on StackOverflow[1]. The question is an excellent example of what I mean.)
I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff".
In the first part of implementing it, Python could raise a SyntaxWarning (or should it be DeprecationWarning? BytesWarning?), suggesting "\x99" to either become b"\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. All of it could be hidden behind a feature like from __future__ import backslashes (one nice name I can think of).
The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.
In the meantime, probably between introducing a warning and changing it to become an error (the most reasonable timeline I can think of now), the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad] :-)
Another timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by from __future__ import backslashes or another flag, then make the repr use \u by default, then add the warning and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable).
This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one --- quite sure it does not). Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 'Python octal escape character encoding "wats"' thread[2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 'Make non-meaningful backslashes illegal in string literals' thread[3] could be revived as well, maybe even with "\f\v" deprecated, "\e" = "\33" introduced and such.
Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.
-1 i think you will break too much valid code. This is valid and does not match your rules. ‘\x9b’ that is the ANSI CSI in 8-bit. In 7-bit it is ‘\x1b[‘. Barry
Cheers, Arusekk
[1]: https://stackoverflow.com/q/64832281/3869724 [2]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE... [3]: https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ITBFU4... Code of Conduct: http://python.org/psf/codeofconduct/