Deprecate misleading escapes in strings
Hi all! I was writing a tutorial on the distinction between bytes and strings and why it is important, when I saw the root cause. People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" often. My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode or unicode escapes (\u, \U, \N). (Actually I found that I started having the idea already back in 2021 on StackOverflow[1]. The question is an excellent example of what I mean.) I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff". In the first part of implementing it, Python could raise a SyntaxWarning (or should it be DeprecationWarning? BytesWarning?), suggesting "\x99" to either become b"\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. All of it could be hidden behind a feature like from __future__ import backslashes (one nice name I can think of). The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones. In the meantime, probably between introducing a warning and changing it to become an error (the most reasonable timeline I can think of now), the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad] :-) Another timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by from __future__ import backslashes or another flag, then make the repr use \u by default, then add the warning and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable). This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one --- quite sure it does not). Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 'Python octal escape character encoding "wats"' thread[2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 'Make non-meaningful backslashes illegal in string literals' thread[3] could be revived as well, maybe even with "\f\v" deprecated, "\e" = "\33" introduced and such. Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have. Cheers, Arusekk [1]: https://stackoverflow.com/q/64832281/3869724 [2]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE... [3]: https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ...
You should bring this up on https://discuss.python.org/c/ideas/6 , which is where ideas are discussed these days. This mailing list should be retired. I’ll mention that elsewhere. -- Eric
On Feb 16, 2023, at 9:57 AM, Arusekk <arek_koz@o2.pl> wrote:
Hi all!
I was writing a tutorial on the distinction between bytes and strings and why it is important, when I saw the root cause. People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" often. My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode or unicode escapes (\u, \U, \N).
(Actually I found that I started having the idea already back in 2021 on StackOverflow[1]. The question is an excellent example of what I mean.)
I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff".
In the first part of implementing it, Python could raise a SyntaxWarning (or should it be DeprecationWarning? BytesWarning?), suggesting "\x99" to either become b"\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. All of it could be hidden behind a feature like from __future__ import backslashes (one nice name I can think of).
The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.
In the meantime, probably between introducing a warning and changing it to become an error (the most reasonable timeline I can think of now), the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad] :-)
Another timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by from __future__ import backslashes or another flag, then make the repr use \u by default, then add the warning and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable).
This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one --- quite sure it does not). Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 'Python octal escape character encoding "wats"' thread[2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 'Make non-meaningful backslashes illegal in string literals' thread[3] could be revived as well, maybe even with "\f\v" deprecated, "\e" = "\33" introduced and such.
Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.
Cheers, Arusekk
[1]: https://stackoverflow.com/q/64832281/3869724 [2]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE... [3]: https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ITBFU4... Code of Conduct: http://python.org/psf/codeofconduct/
On 16 Feb 2023, at 14:57, Arusekk <arek_koz@o2.pl> wrote:
Hi all!
I was writing a tutorial on the distinction between bytes and strings and why it is important, when I saw the root cause. People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" often. My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode or unicode escapes (\u, \U, \N).
(Actually I found that I started having the idea already back in 2021 on StackOverflow[1]. The question is an excellent example of what I mean.)
I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff".
In the first part of implementing it, Python could raise a SyntaxWarning (or should it be DeprecationWarning? BytesWarning?), suggesting "\x99" to either become b"\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. All of it could be hidden behind a feature like from __future__ import backslashes (one nice name I can think of).
The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.
In the meantime, probably between introducing a warning and changing it to become an error (the most reasonable timeline I can think of now), the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad] :-)
Another timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by from __future__ import backslashes or another flag, then make the repr use \u by default, then add the warning and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable).
This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one --- quite sure it does not). Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 'Python octal escape character encoding "wats"' thread[2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 'Make non-meaningful backslashes illegal in string literals' thread[3] could be revived as well, maybe even with "\f\v" deprecated, "\e" = "\33" introduced and such.
Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.
-1 i think you will break too much valid code. This is valid and does not match your rules. ‘\x9b’ that is the ANSI CSI in 8-bit. In 7-bit it is ‘\x1b[‘. Barry
Cheers, Arusekk
[1]: https://stackoverflow.com/q/64832281/3869724 [2]: https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE... [3]: https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ITBFU4... Code of Conduct: http://python.org/psf/codeofconduct/
Wow! That would break SO MUCH of the code I've written! E.g.: translate = {"el": "ἐπιστήμη", "en": "Knowledge", "zh": "知识"} On Thu, Feb 16, 2023 at 9:54 AM Arusekk <arek_koz@o2.pl> wrote:
Hi all!
I was writing a tutorial on the distinction between bytes and strings and why it is important, when I saw the root cause. People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" often. My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode or unicode escapes (\u, \U, \N).
(Actually I found that I started having the idea already back in 2021 on StackOverflow[1]. The question is an excellent example of what I mean.)
I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff".
In the first part of implementing it, Python could raise a SyntaxWarning (or should it be DeprecationWarning? BytesWarning?), suggesting "\x99" to either become b"\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. All of it could be hidden behind a feature like from __future__ import backslashes (one nice name I can think of).
The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.
In the meantime, probably between introducing a warning and changing it to become an error (the most reasonable timeline I can think of now), the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad] :-)
Another timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by from __future__ import backslashes or another flag, then make the repr use \u by default, then add the warning and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable).
This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one --- quite sure it does not). Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 'Python octal escape character encoding "wats"' thread[2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 'Make non-meaningful backslashes illegal in string literals' thread[3] could be revived as well, maybe even with "\f\v" deprecated, "\e" = "\33" introduced and such.
Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.
Cheers, Arusekk
[1]: https://stackoverflow.com/q/64832281/3869724 [2]:
https://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPE... [3]:
https://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJ... _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ITBFU4... Code of Conduct: http://python.org/psf/codeofconduct/
-- The dead increasingly dominate and strangle both the living and the not-yet born. Vampiric capital and undead corporate persons abuse the lives and control the thoughts of homo faber. Ideas, once born, become abortifacients against new conceptions.
W dniu 16.02.2023 o 17:55, David Mertz, Ph.D. pisze:
Wow! That would break SO MUCH of the code I've written! E.g.:
translate = {"el": "ἐπιστήμη", "en": "Knowledge", "zh": "知识"}
You did not use any codepoint in the U+0080-U+00FF range here. Are you sure the primary suggestion would break such code? I only meant deprecate "\xNN" in favor of "\u00NN" in the original idea, because it is too confusing against b"\xNN". No changes to literal unicode characters intended definitely. Sorry for confusing you with my verbosity I guess. Arusekk
On Fri, 17 Feb 2023 at 06:11, Arusekk <arek_koz@o2.pl> wrote:
W dniu 16.02.2023 o 17:55, David Mertz, Ph.D. pisze:
Wow! That would break SO MUCH of the code I've written! E.g.:
translate = {"el": "ἐπιστήμη", "en": "Knowledge", "zh": "知识"}
You did not use any codepoint in the U+0080-U+00FF range here. Are you sure the primary suggestion would break such code?
I only meant deprecate "\xNN" in favor of "\u00NN" in the original idea, because it is too confusing against b"\xNN".
Bytes literals are allowed to contain ASCII characters because bytestrings often do contain textual portions. This could have been changed in Python 3.0, but it wasn't, because it is *useful* to have text and byte strings work similarly. The confusion you're describing is just as strong as: b"Length: %d" % count # versus "Length: %d" % count which was specifically *added* to byte strings because, again, it is incredibly useful. What would actually be gained by breaking text strings in this way, other than a warm fuzzy feeling that even the first 256 codepoints are still represented by four-digit numbers? It's not guaranteeing uniformity of Unicode escapes (since the first 65536 codepoints still get a shorthand that can't be used for the others), it's not actually distinguishing them from byte strings (they have a lot of the same methods and behaviours), and you're breaking a huge amount of perfectly reasonable code. Breaking backward compatibility is a **big deal**. It needs a lot more justification than you've provided. ChrisA
participants (6)
-
Arusekk
-
Barry
-
Ben Rudiak-Gould
-
Chris Angelico
-
David Mertz, Ph.D.
-
Eric V. Smith