Re: [Python-ideas] PEP 8: raw strings & regular expressions
In principle, there is no reason why *both* of these groups of users can't use one tool and be happy. I propose to establish a convention in PEP 8, explaining that, while both literals are semantically equivalent,
- r'..' strings *should* be used for regexps,
- R'..' strings *should* be used for unstyled raw strings,
and tools *should* treat them as such.
All of this is merely about codifying the current status quo. But you are not codifying the status quo. The distinction you are
On 10/22/15 6:56 PM, Yury Selivanov wrote: proposing is one that you have invented. I have never used R"" strings. I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex: r"\d+" r"\dir" If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :) --Ned.
On 10/23/2015 14:40, Ned Batchelder wrote:
In principle, there is no reason why *both* of these groups of users can't use one tool and be happy. I propose to establish a convention in PEP 8, explaining that, while both literals are semantically equivalent,
- r'..' strings *should* be used for regexps,
- R'..' strings *should* be used for unstyled raw strings,
and tools *should* treat them as such.
All of this is merely about codifying the current status quo. But you are not codifying the status quo. The distinction you are
On 10/22/15 6:56 PM, Yury Selivanov wrote: proposing is one that you have invented. I have never used R"" strings.
I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex:
r"\d+" r"\dir"
If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :)
--Ned.
it should be noted that most regexes are also valid paths on NTFS. is r'\dir[a-zA-Z0-9]\\' a path or a regex?
On 10/26/15 3:23 AM, Alexander Walters wrote:
On 10/23/2015 14:40, Ned Batchelder wrote:
In principle, there is no reason why *both* of these groups of users can't use one tool and be happy. I propose to establish a convention in PEP 8, explaining that, while both literals are semantically equivalent,
- r'..' strings *should* be used for regexps,
- R'..' strings *should* be used for unstyled raw strings,
and tools *should* treat them as such.
All of this is merely about codifying the current status quo. But you are not codifying the status quo. The distinction you are
On 10/22/15 6:56 PM, Yury Selivanov wrote: proposing is one that you have invented. I have never used R"" strings.
I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex:
r"\d+" r"\dir"
If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :)
--Ned.
it should be noted that most regexes are also valid paths on NTFS. is r'\dir[a-zA-Z0-9]\\' a path or a regex?
I understand developers' penchant for getting everything precisely right and accounting for the darkest of corners and the farthest reaches of obscure edge cases. But I'm talking about making a reasonable guess. If the string contains square brackets, especially paired brackets with hyphens inside, it's probably a regex. --Ned.
On Oct 26, 2015, at 04:33, Ned Batchelder
On 10/26/15 3:23 AM, Alexander Walters wrote:
On 10/23/2015 14:40, Ned Batchelder wrote:
On 10/22/15 6:56 PM, Yury Selivanov wrote: In principle, there is no reason why *both* of these groups of users can't use one tool and be happy. I propose to establish a convention in PEP 8, explaining that, while both literals are semantically equivalent,
- r'..' strings *should* be used for regexps,
- R'..' strings *should* be used for unstyled raw strings,
and tools *should* treat them as such.
All of this is merely about codifying the current status quo. But you are not codifying the status quo. The distinction you are proposing is one that you have invented. I have never used R"" strings.
I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex:
r"\d+" r"\dir"
If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :)
--Ned.
it should be noted that most regexes are also valid paths on NTFS. is r'\dir[a-zA-Z0-9]\\' a path or a regex? I understand developers' penchant for getting everything precisely right and accounting for the darkest of corners and the farthest reaches of obscure edge cases. But I'm talking about making a reasonable guess. If the string contains square brackets, especially paired brackets with hyphens inside, it's probably a regex.
From working on music tagging software, I can tell you that an awful lot of users have mp3s with square brackets, hyphens, and other such things in their filenames, so if your software makes any assumptions about what filenames look like, their libraries will break your software. And to verify that this isn't some weird artifact of the way people used to name files on piracy networks back when people traded individual songs, I went to The Pirate Bay and checked the most popular current download in any category, and its first file is named: [ www.CpasBien.pw ] Tomorrowland.2015.TRUEFRENCH.BDRip.VxiD-EXTREME.avi So, I don't think you can assume that paired square brackets or hyphens mean something is not a Windows pathname. Of course with a wide enough corpus of filenames people have to deal with, you could come up with a better heuristic. (Not many regexes have character classes that are dotted domain names, or match a standard language code followed by "-sub", or most of the other examples I see from a quick scan.) But just guessing based on what you guess filenames are like without looking around is not going to get you that far.
On Oct 26, 2015, at 04:55, Andrew Barnert via Python-ideas
On Oct 26, 2015, at 04:33, Ned Batchelder
wrote: On 10/26/15 3:23 AM, Alexander Walters wrote:
On 10/23/2015 14:40, Ned Batchelder wrote:
On 10/22/15 6:56 PM, Yury Selivanov wrote: In principle, there is no reason why *both* of these groups of users can't use one tool and be happy. I propose to establish a convention in PEP 8, explaining that, while both literals are semantically equivalent,
- r'..' strings *should* be used for regexps,
- R'..' strings *should* be used for unstyled raw strings,
and tools *should* treat them as such.
All of this is merely about codifying the current status quo. But you are not codifying the status quo. The distinction you are proposing is one that you have invented. I have never used R"" strings.
I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex:
r"\d+" r"\dir"
If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :)
--Ned.
it should be noted that most regexes are also valid paths on NTFS. is r'\dir[a-zA-Z0-9]\\' a path or a regex? I understand developers' penchant for getting everything precisely right and accounting for the darkest of corners and the farthest reaches of obscure edge cases. But I'm talking about making a reasonable guess. If the string contains square brackets, especially paired brackets with hyphens inside, it's probably a regex.
From working on music tagging software, I can tell you that an awful lot of users have mp3s with square brackets, hyphens, and other such things in their filenames, so if your software makes any assumptions about what filenames look like, their libraries will break your software.
And to verify that this isn't some weird artifact of the way people used to name files on piracy networks back when people traded individual songs, I went to The Pirate Bay and checked the most popular current download in any category, and its first file is named:
[ www.CpasBien.pw ] Tomorrowland.2015.TRUEFRENCH.BDRip.VxiD-EXTREME.avi
So, I don't think you can assume that paired square brackets or hyphens mean something is not a Windows pathname.
Of course with a wide enough corpus of filenames people have to deal with, you could come up with a better heuristic. (Not many regexes have character classes that are dotted domain names, or match a standard language code followed by "-sub", or most of the other examples I see from a quick scan.) But just guessing based on what you guess filenames are like without looking around is not going to get you that far.
Just for fun: is there a Python regex that matches all valid Python regexes? Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension). I think a Python regex to match all actual regular expressions should be pretty easy. But I'm not sure about a Python regex to match all Python regexes. (Although I'll bet if it's possible for perl, someone has written a CPAN module. Probably without "verbose-mode" whitespace or comments.) That still wouldn't solve the problem of the many things that are valid regexes and also valid Windows paths (not to mention valid doc strings with embedded code examples, because that includes any possible string…), or detecting things that are obviously intended to be regexes even though they're invalid, etc., so it's probably not very useful for this heuristic anyway. Hence the "just for fun"…
Andrew Barnert via Python-ideas
Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension).
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult. -- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney
Obviously there can't be a regex to exclude everything that isn't a regex.
Parentheses can nest to unlimited depths, so you need a formal grammar. But
virtually everything that is a Windows path is also formally grammatical
regex too (as are many things with no plausible likely intention as such)
On Oct 26, 2015 2:44 PM, "Ben Finney"
Andrew Barnert via Python-ideas
writes: Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension).
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult.
-- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Oct 26, 2015, at 14:53, David Mertz
Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar.
As I said:
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension)
But you can do it trivially with Perl, or with the Regex module for Python, e.g., just by sticking a "(?1)" inside a pair of escaped parens plus a negative lookahead or nongreedy repetition. I'm not sure exactly how powerful Python (re module) regexes are (if I want to match something that isn't a regular language, I tend to reach for or build a dedicated parser rather than try to stretch re), but I know they're somewhere between actual regular expressions and perl regexes.
But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such)
That's not true. You can, for example, have unclosed brackets or parens in a Windows path. And if you're wondering why anyone would do that, consider MP3 files auto-named based on their ID3v1/FreeDB metadata, which truncates fields at 29 or 30 bytes. Anyway, as I said in the same message, it wouldn't be a useful heuristic because there's so much overlap, but you don't need to exaggerate that to make the same point.
On Oct 26, 2015 2:44 PM, "Ben Finney"
wrote: Andrew Barnert via Python-ideas writes: Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension).
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult.
-- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
I think we're down to quibbling over the meaning of "virtually" here. I
recognize it's "not all" and you recognize it's "most" windows paths are
grammatically regexen.
So is it 80%, 95%, 99.9%? And do we mean "paths found in the wild" or
"paths as systematically enumerated" from possibility space?
On Oct 26, 2015 3:23 PM, "Andrew Barnert"
On Oct 26, 2015, at 14:53, David Mertz
wrote: Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar.
As I said:
Obviously there's no actual regular expression that matches all regular
expressions (you can't handle matched brackets without recursion or some other extension)
But you can do it trivially with Perl, or with the Regex module for Python, e.g., just by sticking a "(?1)" inside a pair of escaped parens plus a negative lookahead or nongreedy repetition. I'm not sure exactly how powerful Python (re module) regexes are (if I want to match something that isn't a regular language, I tend to reach for or build a dedicated parser rather than try to stretch re), but I know they're somewhere between actual regular expressions and perl regexes.
But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such)
That's not true. You can, for example, have unclosed brackets or parens in a Windows path. And if you're wondering why anyone would do that, consider MP3 files auto-named based on their ID3v1/FreeDB metadata, which truncates fields at 29 or 30 bytes.
Anyway, as I said in the same message, it wouldn't be a useful heuristic because there's so much overlap, but you don't need to exaggerate that to make the same point.
On Oct 26, 2015 2:44 PM, "Ben Finney"
wrote: Andrew Barnert via Python-ideas
writes: Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension).
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult.
-- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Oct 27, 2015 at 8:44 AM, Ben Finney
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult.
"Rather more difficult" may be an understatement. A regex can contain grouping parentheses which can arbitrarily nest, and matching that with a regex is, AIUI, fundamentally impossible. So I don't think it's possible to have a regex that validates a regex. Fortunately, it's easy to write a function that validates a regex. def is_regex(s): try: re.compile(s) except re.error: return False return True ChrisA
On 27.10.15 04:26, Chris Angelico wrote:
Fortunately, it's easy to write a function that validates a regex.
def is_regex(s): try: re.compile(s) except re.error: return False return True
re.compile() also can raise OverflowError ('.{,9999999999}') and ValueError ('(?ua)').
On Monday, October 26, 2015 10:26 PM, Chris Angelico wrote
On Tue, Oct 27, 2015 at 8:44 AM, Ben Finney
wrote: You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult.
"Rather more difficult" may be an understatement...
Indeed, in fact I think that Gödel's Incompleteness Theorem applies here. Andy Henshaw
On 10/26/2015 3:38 PM, Andrew Barnert via Python-ideas wrote:
Just for fun: is there a Python regex that matches all valid Python regexes?
Could you take this discussion to python-list. The PEP8 proposal has been rejected. -- Terry Jan Reedy
participants (9)
-
Alexander Walters
-
Andrew Barnert
-
Ben Finney
-
Chris Angelico
-
David Mertz
-
Henshaw, Andy
-
Ned Batchelder
-
Serhiy Storchaka
-
Terry Reedy