PEP 8: raw strings & regular expressions

Hi, I was a little bit frustrated that Sublime Text and Atom didn't support all Python 3 features (mainly annotations & async/await syntax), so I decided to write a new highlighter: https://github.com/MagicStack/MagicPython In the process, we had to make a decision on how to highlight raw string literals -- r''. Many existing highlighters assume that all raw strings are regexps, and highlight them as such, i.e. '\s' and '\n' will be highlighted. I think that it might be a good idea to state the following in PEP 8: - use r'...' strings for raw strings that describe regular expressions; these strings might be highlighted specially in some editors. - use R'...' strings for raw strings; editors *should not* highlight any escaped characters in them. What do you think? Yury

Yury Selivanov <yselivanov.ml@gmail.com> writes:
Thanks for scratching your itch and releasing the result as free software!
That is evidently a simple mistake. Merely knowing that a token is a raw string does not justify the assumption that the string is a regular expression, or a filesystem entry name, or a line in a network protocol, or anything except plain text. Perhaps some more explicit context could be used to signal what the intent of a raw string is, but you'd need to find a strong consensus that programmers actually intend that. “It's a raw string” doesn't justify any of those assumptions.
I think that it might be a good idea to state the following in PEP 8:
No, I don't think the mistaken assumptions you've described should be enshrined in a style guide. Instead, the mistaken assumptions should be changed. -- \ “We have clumsy, sputtering, inefficient brains…. It is a | `\ *struggle* to be rational and objective, and failures are not | _o__) evidence for an alternative reality.” —Paul Z. Myers, 2010-10-14 | Ben Finney

On 2015-10-21 10:44 PM, Ben Finney wrote:
I agree 100%. But: github, gitlab, Atom, Sublime Text, and many other tools assume that raw strings (with lowercase r) are regexps. If you don't highlight them as such, people think that it's a bug. Since I wanted MagicPython to be a drop-in replacement for standard highlighters, I simply *could not* change this behavior. It's already a standard of some sorts, whether we like it or not.
If we want to design some special marker for highlighters to hint what language is in the string, I'd strongly suggest that it should be before the string literal. For instance, it *won't* be possible for most highlighters to detect this: my_re = '''... ...''' # regex Yury

On 2015-10-21 10:58 PM, Ryan Gonzalez wrote:
Yeah. I even created a PR to use MagicPython in language-python a few days ago. BTW, here's a link to github to show how raw strings are highlighted: https://github.com/python/cpython/blob/master/Lib/_pydecimal.py#L6087 Yury

On Wed, Oct 21, 2015 at 7:44 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
This isn't necessarily true, just as a matter of like... epistemology. For example, if hypothetically it turned out that 99% of raw strings are in fact regular expressions, then knowing something is a raw string would give you quite a bit of evidence that it's a regular expression -- quite possibly enough to justify treating it as such for something like code highlighting. I haven't actually gathered any data to find out how strong the association between raw strings and regexen is, but it'd be pretty easy for someone to do. (Parse a large corpus of python code to extract all raw strings, randomly subsample 100 of them, review manually to decide if each is a regex.) -n

On 2015-10-21 11:31 PM, Nathaniel Smith wrote:
I haven't done this in a scientific way you suggest, but I did glance over the stdlib when I was testing MP. Most of the raw strings I saw were either docstrings or regexps. Docstrings are possible to detect by highlighters, so MagicPython does not highlight r'' as regexp if it's a docstring. Yury

Nathaniel Smith <njs@pobox.com> writes:
Well, yes, if you like. Epistemically, a sytntax highlighter cannot know that a raw string is, merely because it's a raw string, definitely a regular expression pattern. We have a definition of the language which allows syntax highlighters to know with certainty what is and is not a particular element of the language. So if the highlighter shows a sequence of characters as being what the Python language definition says it is, then it will not be wrong in any case. We do not have a definition which allows syntax highlighters to decide that a raw string is or is not a regular expression, merely because it's a raw string. So if a highlighter shows a sequence of characters in a Python program as being a regular expression, it will be wrong for some cases.
Presenting the code highlighted to show particular semantics is a binary state: it either is shown as (for example) a regular expression, or it is not. The reader only gets to see what the highlighter decided, not how certain the epistemic decision was. How is the person viewing it to know whether the highlighter is wrong about the intention of the code in any particular case, or if the highlighter is right and the code doesn't match the author's intention? If the reader has to second-guess the highlighter (am I wrong here, or is the highlighter wrong, or both?) every time it doesn't match expectations, that's a poor syntax highlighter which should never have made such a binary decision on uncertain data. -- \ “Never express yourself more clearly than you are able to | `\ think.” —Niels Bohr | _o__) | Ben Finney

On 22 October 2015 at 05:31, Nathaniel Smith <njs@pobox.com> wrote:
I'd expect Windows filesystem paths to win handily if we could scan all the Python code in the world, but they wouldn't show up in a scan of POSIX specific open source code. Cheers, Nick -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 22.10.2015 04:44, Ben Finney wrote:
Agreed. Highlighters should follow language definitions, not the other way around ;-) Yuri: Perhaps you could make the behavior optional in your highlighter and allow people to turn off the highlighting as regular expression, if they find they don't like the highlighting. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 22 2015)
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On October 21, 2015 9:24:02 PM CDT, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Now see if you can get the Linguist guys to use it! :) Every time I look at the Fbuild source online, my eyes sting due to the lack of annotation support in language-python, which screws up highlighting for a bit: https://github.com/felix-lang/fbuild/blob/master/lib/fbuild/builders/c/gcc/_... - Keywords aren't highlighted for several lines. Or: https://github.com/felix-lang/fbuild/blob/master/lib/fbuild/builders/bison.p... Where almost (but not quiet) everything is un-highlighted! :O
-- Sent from my Nexus 5 with K-9 Mail. Please excuse my brevity.

On 10/21/2015 10:24 PM, Yury Selivanov wrote:
What 3rd parth editors do is their business.
I think it a bad idea. For beginners on Windows, r'windows\path\file.py' might be more common than r're'. I have never seen R used. If you wanted to promote the use of the currently rare R for REs, and have editors specially mark raw literals with this special prefix, I would not mind. -- Terry Jan Reedy

On Oct 22, 2015, at 00:44, Terry Reedy <tjreedy@udel.edu> wrote:
It's also worth noting that an awful lot of code that uses Windows pathnames is either beginner code, local scripts, or closed-source commercial code, which means a typical code search is probably going to vastly underrepresent how common they are in raw strings. (Of course that same fact means it may be perfectly reasonable for GitHub to assume raw strings are regexps rather than Windows pathnames, even if it isn't reasonable for Python itself, or general-purpose tools like IDLE…)
I have never seen R used.
If you wanted to promote the use of the currently rare R for REs, and have editors specially mark raw literals with this special prefix, I would not mind.
That doesn't sound as bad. But I still don't like it. Where else does Python provide two equivalent ways to do something, specifically to support external semantic connotations? It's like having <> and != both mean the same thing to support people coming up with some language-external difference between the spellings. (Yes, I realize there are a few cases like this—e.g., someone could use the fact that int and 'int' annotate the same type to give them different connotations—but those are accidental effects of some other language feature; PEP 8 certainly isn't going to suggest using 'int' to mean one thing and int another.) If we really want there to be a difference, we should have a regex literal syntax—maybe an x or s prefix or something—in place of re.compile(r'…').

On Thu, Oct 22, 2015 at 9:00 PM, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
But I still don't like it. Where else does Python provide two equivalent ways to do something, specifically to support external semantic connotations? It's like having <> and != both mean the same thing to support people coming up with some language-external difference between the spellings.
I don't know, but since PEP 8 recommends using """ for docstrings and not ''', it would be entirely possible for someone to assign special meaning to the use of ''' for a docstring. ChrisA

Yury Selivanov <yselivanov.ml@gmail.com> writes:
Thanks for scratching your itch and releasing the result as free software!
That is evidently a simple mistake. Merely knowing that a token is a raw string does not justify the assumption that the string is a regular expression, or a filesystem entry name, or a line in a network protocol, or anything except plain text. Perhaps some more explicit context could be used to signal what the intent of a raw string is, but you'd need to find a strong consensus that programmers actually intend that. “It's a raw string” doesn't justify any of those assumptions.
I think that it might be a good idea to state the following in PEP 8:
No, I don't think the mistaken assumptions you've described should be enshrined in a style guide. Instead, the mistaken assumptions should be changed. -- \ “We have clumsy, sputtering, inefficient brains…. It is a | `\ *struggle* to be rational and objective, and failures are not | _o__) evidence for an alternative reality.” —Paul Z. Myers, 2010-10-14 | Ben Finney

On 2015-10-21 10:44 PM, Ben Finney wrote:
I agree 100%. But: github, gitlab, Atom, Sublime Text, and many other tools assume that raw strings (with lowercase r) are regexps. If you don't highlight them as such, people think that it's a bug. Since I wanted MagicPython to be a drop-in replacement for standard highlighters, I simply *could not* change this behavior. It's already a standard of some sorts, whether we like it or not.
If we want to design some special marker for highlighters to hint what language is in the string, I'd strongly suggest that it should be before the string literal. For instance, it *won't* be possible for most highlighters to detect this: my_re = '''... ...''' # regex Yury

On 2015-10-21 10:58 PM, Ryan Gonzalez wrote:
Yeah. I even created a PR to use MagicPython in language-python a few days ago. BTW, here's a link to github to show how raw strings are highlighted: https://github.com/python/cpython/blob/master/Lib/_pydecimal.py#L6087 Yury

On Wed, Oct 21, 2015 at 7:44 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
This isn't necessarily true, just as a matter of like... epistemology. For example, if hypothetically it turned out that 99% of raw strings are in fact regular expressions, then knowing something is a raw string would give you quite a bit of evidence that it's a regular expression -- quite possibly enough to justify treating it as such for something like code highlighting. I haven't actually gathered any data to find out how strong the association between raw strings and regexen is, but it'd be pretty easy for someone to do. (Parse a large corpus of python code to extract all raw strings, randomly subsample 100 of them, review manually to decide if each is a regex.) -n

On 2015-10-21 11:31 PM, Nathaniel Smith wrote:
I haven't done this in a scientific way you suggest, but I did glance over the stdlib when I was testing MP. Most of the raw strings I saw were either docstrings or regexps. Docstrings are possible to detect by highlighters, so MagicPython does not highlight r'' as regexp if it's a docstring. Yury

Nathaniel Smith <njs@pobox.com> writes:
Well, yes, if you like. Epistemically, a sytntax highlighter cannot know that a raw string is, merely because it's a raw string, definitely a regular expression pattern. We have a definition of the language which allows syntax highlighters to know with certainty what is and is not a particular element of the language. So if the highlighter shows a sequence of characters as being what the Python language definition says it is, then it will not be wrong in any case. We do not have a definition which allows syntax highlighters to decide that a raw string is or is not a regular expression, merely because it's a raw string. So if a highlighter shows a sequence of characters in a Python program as being a regular expression, it will be wrong for some cases.
Presenting the code highlighted to show particular semantics is a binary state: it either is shown as (for example) a regular expression, or it is not. The reader only gets to see what the highlighter decided, not how certain the epistemic decision was. How is the person viewing it to know whether the highlighter is wrong about the intention of the code in any particular case, or if the highlighter is right and the code doesn't match the author's intention? If the reader has to second-guess the highlighter (am I wrong here, or is the highlighter wrong, or both?) every time it doesn't match expectations, that's a poor syntax highlighter which should never have made such a binary decision on uncertain data. -- \ “Never express yourself more clearly than you are able to | `\ think.” —Niels Bohr | _o__) | Ben Finney

On 22 October 2015 at 05:31, Nathaniel Smith <njs@pobox.com> wrote:
I'd expect Windows filesystem paths to win handily if we could scan all the Python code in the world, but they wouldn't show up in a scan of POSIX specific open source code. Cheers, Nick -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 22.10.2015 04:44, Ben Finney wrote:
Agreed. Highlighters should follow language definitions, not the other way around ;-) Yuri: Perhaps you could make the behavior optional in your highlighter and allow people to turn off the highlighting as regular expression, if they find they don't like the highlighting. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 22 2015)
::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On October 21, 2015 9:24:02 PM CDT, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Now see if you can get the Linguist guys to use it! :) Every time I look at the Fbuild source online, my eyes sting due to the lack of annotation support in language-python, which screws up highlighting for a bit: https://github.com/felix-lang/fbuild/blob/master/lib/fbuild/builders/c/gcc/_... - Keywords aren't highlighted for several lines. Or: https://github.com/felix-lang/fbuild/blob/master/lib/fbuild/builders/bison.p... Where almost (but not quiet) everything is un-highlighted! :O
-- Sent from my Nexus 5 with K-9 Mail. Please excuse my brevity.

On 10/21/2015 10:24 PM, Yury Selivanov wrote:
What 3rd parth editors do is their business.
I think it a bad idea. For beginners on Windows, r'windows\path\file.py' might be more common than r're'. I have never seen R used. If you wanted to promote the use of the currently rare R for REs, and have editors specially mark raw literals with this special prefix, I would not mind. -- Terry Jan Reedy

On Oct 22, 2015, at 00:44, Terry Reedy <tjreedy@udel.edu> wrote:
It's also worth noting that an awful lot of code that uses Windows pathnames is either beginner code, local scripts, or closed-source commercial code, which means a typical code search is probably going to vastly underrepresent how common they are in raw strings. (Of course that same fact means it may be perfectly reasonable for GitHub to assume raw strings are regexps rather than Windows pathnames, even if it isn't reasonable for Python itself, or general-purpose tools like IDLE…)
I have never seen R used.
If you wanted to promote the use of the currently rare R for REs, and have editors specially mark raw literals with this special prefix, I would not mind.
That doesn't sound as bad. But I still don't like it. Where else does Python provide two equivalent ways to do something, specifically to support external semantic connotations? It's like having <> and != both mean the same thing to support people coming up with some language-external difference between the spellings. (Yes, I realize there are a few cases like this—e.g., someone could use the fact that int and 'int' annotate the same type to give them different connotations—but those are accidental effects of some other language feature; PEP 8 certainly isn't going to suggest using 'int' to mean one thing and int another.) If we really want there to be a difference, we should have a regex literal syntax—maybe an x or s prefix or something—in place of re.compile(r'…').

On Thu, Oct 22, 2015 at 9:00 PM, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
But I still don't like it. Where else does Python provide two equivalent ways to do something, specifically to support external semantic connotations? It's like having <> and != both mean the same thing to support people coming up with some language-external difference between the spellings.
I don't know, but since PEP 8 recommends using """ for docstrings and not ''', it would be entirely possible for someone to assign special meaning to the use of ''' for a docstring. ChrisA
participants (12)
-
Andrew Barnert
-
Ben Finney
-
Chris Angelico
-
M.-A. Lemburg
-
Nathaniel Smith
-
Nick Coghlan
-
Rob Cliffe
-
Ryan Gonzalez
-
Serhiy Storchaka
-
Sven R. Kunze
-
Terry Reedy
-
Yury Selivanov