Raw string literals and trailing backslash
Currently a raw literal cannot end in a single backslash (e.g. in r"C:\User\"). Although there are reasons for this. It is an old gotcha, and there are many closed issues about it. This question is even included in FAQ. The most common workarounds are: r"C:\User" "\\" and r"C:\User\ "[:-1] I tried to experiment. It was easy to make the parser allowing a trailing backslash character. It was more difficult to change the Python implementation in the tokenizer module. But this change breaks existing code in more sites than I expected. 14 Python files in the stdlib (not counting tokenizer.py) will need to be fixed. In all cases it is a regular expression. Few examples: 1. r"([\"\\])" If only one type of quotes is used in a string, we can just use different kind of quotes for creating a string literal and remove escaping. r'(["\\])' 2. r'(\'[^\']*\'|"[^"]*"|...' If different types o quotes are used in different parts of a string, we can use implicit concatenation of string literals created with different quotes (in any case a regular expression is long and should be split on several lines on semantic boundaries). r"('[^']*'|" r'"[^"]*"|' r'...' 3. r"([^.'\"\\#]\b|^)" You can also use triple quotes if the string contain both type of quotes together. r"""([^.'"\\#]\b|^)""" 4. In rare cases a multiline raw string literals can contain both `'''` and `"""`. In this case you can use implicit concatenation of string literals created with different triple quotes. See https://github.com/python/cpython/pull/15217 . I do not think we are ready for such breaking change. It will break more code than forbidding unrecognized escape sequences, and the required fixes are less trivial.
On 8/12/2019 12:08 AM, Serhiy Storchaka wrote:
Currently a raw literal cannot end in a single backslash (e.g. in r"C:\User\"). Although there are reasons for this. It is an old gotcha, and there are many closed issues about it. This question is even included in FAQ.
Hmm. I didn't find it documentation, and searching several ways for it in a FAQ, I wasn't able to find it either.
The most common workarounds are:
r"C:\User" "\\"
and
r"C:\User\ "[:-1]
I tried to experiment. It was easy to make the parser allowing a trailing backslash character. It was more difficult to change the Python implementation in the tokenizer module. But this change breaks existing code in more sites than I expected. 14 Python files in the stdlib (not counting tokenizer.py) will need to be fixed. In all cases it is a regular expression.
Few examples:
1. r"([\"\\])"
If only one type of quotes is used in a string, we can just use different kind of quotes for creating a string literal and remove escaping.
r'(["\\])'
2. r'(\'[^\']*\'|"[^"]*"|...'
If different types o quotes are used in different parts of a string, we can use implicit concatenation of string literals created with different quotes (in any case a regular expression is long and should be split on several lines on semantic boundaries).
r"('[^']*'|" r'"[^"]*"|' r'...'
3. r"([^.'\"\\#]\b|^)"
You can also use triple quotes if the string contain both type of quotes together.
r"""([^.'"\\#]\b|^)"""
4. In rare cases a multiline raw string literals can contain both `'''` and `"""`. In this case you can use implicit concatenation of string literals created with different triple quotes.
See https://github.com/python/cpython/pull/15217 .
I do not think we are ready for such breaking change. It will break more code than forbidding unrecognized escape sequences, and the required fixes are less trivial.
Thanks for your investigation, Serhiy. Point 3 seems like the easiest way to convert most regular expressions containing \" or \' from r"..." form to v"""...""", without disturbing the internal gibberish in the regular expression, and without needing significant analysis. Regarding point 4, if it is a string literal used as a regexp, internal triple quotes can be recoded as "{3} and '{3} . But whether or not it is used as a regexp, I fail to find a syntax that permits the creation of a multiline raw string contining both "'''" and '"""', without using implicit concatenation. Since implicit concatenation must already be in use for that case, converting from raw string to verbatim string is straightforward.
12.08.19 22:41, Glenn Linderman пише:
On 8/12/2019 12:08 AM, Serhiy Storchaka wrote:
Currently a raw literal cannot end in a single backslash (e.g. in r"C:\User\"). Although there are reasons for this. It is an old gotcha, and there are many closed issues about it. This question is even included in FAQ.
Hmm. I didn't find it documentation, and searching several ways for it in a FAQ, I wasn't able to find it either.
https://docs.python.org/3/faq/design.html#why-can-t-raw-strings-r-strings-en...
Thanks for your investigation, Serhiy. Point 3 seems like the easiest way to convert most regular expressions containing \" or \' from r"..." form to v"""...""", without disturbing the internal gibberish in the regular expression, and without needing significant analysis.
No new prefix is needed, since a single trailing backslash is never a problem in regular expression (as it is an illegal RE syntax).
Regarding point 4, if it is a string literal used as a regexp, internal triple quotes can be recoded as "{3} and '{3} .
Good point! This is yet one option.
On 8/12/2019 10:21 PM, Serhiy Storchaka wrote:
12.08.19 22:41, Glenn Linderman пише:
On 8/12/2019 12:08 AM, Serhiy Storchaka wrote:
Currently a raw literal cannot end in a single backslash (e.g. in r"C:\User\"). Although there are reasons for this. It is an old gotcha, and there are many closed issues about it. This question is even included in FAQ.
Hmm. I didn't find it documentation, and searching several ways for it in a FAQ, I wasn't able to find it either.
https://docs.python.org/3/faq/design.html#why-can-t-raw-strings-r-strings-en...
Thanks. After my Google searches failed, I looked at the Python FAQ TOC, and the sections that seemed most promising seemed to be "General" and "Programming" and "Python on Windows". I never thought to look under "Design and History". "Programming" actually had a section on strings, and it wasn't there... which reduced my enthusiasm for reading the whole thing, and since it is in 8 sections, it was cumbersome to do a global search in the browser. It looks like the FAQ is part of the standard documentation, but it seems like it would be more useful if there were cross-links between the documentation and the related FAQs.
Thanks for your investigation, Serhiy. Point 3 seems like the easiest way to convert most regular expressions containing \" or \' from r"..." form to v"""...""", without disturbing the internal gibberish in the regular expression, and without needing significant analysis.
No new prefix is needed, since a single trailing backslash is never a problem in regular expression (as it is an illegal RE syntax).
I'd be interested in your comments on my future import idea <https://mail.python.org/archives/list/python-dev@python.org/message/XJNS45JG...> either here or privately. After 30 years of Python, it seems that there are quite a few warts in the string syntax, and a fresh start might be appropriate, as well as simpler to document, learn, and teach, and future import would allow a gradual, opt-in migration. It may be a long time, if ever, before the current syntax warts could be removed and the future import eliminated, but from the sounds of things, it might also be a long time, if ever, before there can be agreement on adding new escapes or giving errors for bad ones in the present syntax: making any changes without introducing a new prefix is a breaking, incompatible change.
Regarding point 4, if it is a string literal used as a regexp, internal triple quotes can be recoded as "{3} and '{3} .
Good point! This is yet one option. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VR34LGEW...
participants (2)
-
Glenn Linderman
-
Serhiy Storchaka