Re: [Python-ideas] PEP 8: raw strings & regular expressions

On 10/22/15 6:56 PM, Yury Selivanov wrote: proposing is one that you have invented. I have never used R"" strings. I think the best solution to the problem is to improve the highlighters, and luckily you have written one! To me, it is clear which of these strings is the regex: r"\d+" r"\dir" If the highlighters tried some heuristics, they could do a better job "being helpful" by making better guesses about the meaning of programs. I don't mind when highlighters make wrong guesses, as long as they don't ruin the entire rest of the file. But better guesses will be better. :) --Ned.

On 10/26/15 3:23 AM, Alexander Walters wrote:
I understand developers' penchant for getting everything precisely right and accounting for the darkest of corners and the farthest reaches of obscure edge cases. But I'm talking about making a reasonable guess. If the string contains square brackets, especially paired brackets with hyphens inside, it's probably a regex. --Ned.

On Oct 26, 2015, at 04:33, Ned Batchelder <ned@nedbatchelder.com> wrote:
From working on music tagging software, I can tell you that an awful lot of users have mp3s with square brackets, hyphens, and other such things in their filenames, so if your software makes any assumptions about what filenames look like, their libraries will break your software. And to verify that this isn't some weird artifact of the way people used to name files on piracy networks back when people traded individual songs, I went to The Pirate Bay and checked the most popular current download in any category, and its first file is named: [ www.CpasBien.pw ] Tomorrowland.2015.TRUEFRENCH.BDRip.VxiD-EXTREME.avi So, I don't think you can assume that paired square brackets or hyphens mean something is not a Windows pathname. Of course with a wide enough corpus of filenames people have to deal with, you could come up with a better heuristic. (Not many regexes have character classes that are dotted domain names, or match a standard language code followed by "-sub", or most of the other examples I see from a quick scan.) But just guessing based on what you guess filenames are like without looking around is not going to get you that far.

On Oct 26, 2015, at 04:55, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Just for fun: is there a Python regex that matches all valid Python regexes? Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension). I think a Python regex to match all actual regular expressions should be pretty easy. But I'm not sure about a Python regex to match all Python regexes. (Although I'll bet if it's possible for perl, someone has written a CPAN module. Probably without "verbose-mode" whitespace or comments.) That still wouldn't solve the problem of the many things that are valid regexes and also valid Windows paths (not to mention valid doc strings with embedded code examples, because that includes any possible string…), or detecting things that are obviously intended to be regexes even though they're invalid, etc., so it's probably not very useful for this heuristic anyway. Hence the "just for fun"…

Andrew Barnert via Python-ideas <python-ideas@python.org> writes:
Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult. -- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney

Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar. But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such) On Oct 26, 2015 2:44 PM, "Ben Finney" <ben+python@benfinney.id.au> wrote:

On Oct 26, 2015, at 14:53, David Mertz <mertz@gnosis.cx> wrote:
Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar.
As I said:
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension)
But you can do it trivially with Perl, or with the Regex module for Python, e.g., just by sticking a "(?1)" inside a pair of escaped parens plus a negative lookahead or nongreedy repetition. I'm not sure exactly how powerful Python (re module) regexes are (if I want to match something that isn't a regular language, I tend to reach for or build a dedicated parser rather than try to stretch re), but I know they're somewhere between actual regular expressions and perl regexes.
But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such)
That's not true. You can, for example, have unclosed brackets or parens in a Windows path. And if you're wondering why anyone would do that, consider MP3 files auto-named based on their ID3v1/FreeDB metadata, which truncates fields at 29 or 30 bytes. Anyway, as I said in the same message, it wouldn't be a useful heuristic because there's so much overlap, but you don't need to exaggerate that to make the same point.

I think we're down to quibbling over the meaning of "virtually" here. I recognize it's "not all" and you recognize it's "most" windows paths are grammatically regexen. So is it 80%, 95%, 99.9%? And do we mean "paths found in the wild" or "paths as systematically enumerated" from possibility space? On Oct 26, 2015 3:23 PM, "Andrew Barnert" <abarnert@yahoo.com> wrote:

On Tue, Oct 27, 2015 at 8:44 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
"Rather more difficult" may be an understatement. A regex can contain grouping parentheses which can arbitrarily nest, and matching that with a regex is, AIUI, fundamentally impossible. So I don't think it's possible to have a regex that validates a regex. Fortunately, it's easy to write a function that validates a regex. def is_regex(s): try: re.compile(s) except re.error: return False return True ChrisA

On 10/26/2015 3:38 PM, Andrew Barnert via Python-ideas wrote:
Just for fun: is there a Python regex that matches all valid Python regexes?
Could you take this discussion to python-list. The PEP8 proposal has been rejected. -- Terry Jan Reedy

On 10/26/15 3:23 AM, Alexander Walters wrote:
I understand developers' penchant for getting everything precisely right and accounting for the darkest of corners and the farthest reaches of obscure edge cases. But I'm talking about making a reasonable guess. If the string contains square brackets, especially paired brackets with hyphens inside, it's probably a regex. --Ned.

On Oct 26, 2015, at 04:33, Ned Batchelder <ned@nedbatchelder.com> wrote:
From working on music tagging software, I can tell you that an awful lot of users have mp3s with square brackets, hyphens, and other such things in their filenames, so if your software makes any assumptions about what filenames look like, their libraries will break your software. And to verify that this isn't some weird artifact of the way people used to name files on piracy networks back when people traded individual songs, I went to The Pirate Bay and checked the most popular current download in any category, and its first file is named: [ www.CpasBien.pw ] Tomorrowland.2015.TRUEFRENCH.BDRip.VxiD-EXTREME.avi So, I don't think you can assume that paired square brackets or hyphens mean something is not a Windows pathname. Of course with a wide enough corpus of filenames people have to deal with, you could come up with a better heuristic. (Not many regexes have character classes that are dotted domain names, or match a standard language code followed by "-sub", or most of the other examples I see from a quick scan.) But just guessing based on what you guess filenames are like without looking around is not going to get you that far.

On Oct 26, 2015, at 04:55, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Just for fun: is there a Python regex that matches all valid Python regexes? Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension). I think a Python regex to match all actual regular expressions should be pretty easy. But I'm not sure about a Python regex to match all Python regexes. (Although I'll bet if it's possible for perl, someone has written a CPAN module. Probably without "verbose-mode" whitespace or comments.) That still wouldn't solve the problem of the many things that are valid regexes and also valid Windows paths (not to mention valid doc strings with embedded code examples, because that includes any possible string…), or detecting things that are obviously intended to be regexes even though they're invalid, etc., so it's probably not very useful for this heuristic anyway. Hence the "just for fun"…

Andrew Barnert via Python-ideas <python-ideas@python.org> writes:
Just for fun: is there a Python regex that matches all valid Python regexes?
Yes: ‘.*’ matches all valid Python regexes.
You seem to be seeking something else: a pattern that matches all valid regex patterns, *and* will never match any string that is not a valid regex pattern. The latter is rather more difficult. -- \ “When I was born I was so surprised I couldn't talk for a year | `\ and a half.” —Gracie Allen | _o__) | Ben Finney

Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar. But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such) On Oct 26, 2015 2:44 PM, "Ben Finney" <ben+python@benfinney.id.au> wrote:

On Oct 26, 2015, at 14:53, David Mertz <mertz@gnosis.cx> wrote:
Obviously there can't be a regex to exclude everything that isn't a regex. Parentheses can nest to unlimited depths, so you need a formal grammar.
As I said:
Obviously there's no actual regular expression that matches all regular expressions (you can't handle matched brackets without recursion or some other extension)
But you can do it trivially with Perl, or with the Regex module for Python, e.g., just by sticking a "(?1)" inside a pair of escaped parens plus a negative lookahead or nongreedy repetition. I'm not sure exactly how powerful Python (re module) regexes are (if I want to match something that isn't a regular language, I tend to reach for or build a dedicated parser rather than try to stretch re), but I know they're somewhere between actual regular expressions and perl regexes.
But virtually everything that is a Windows path is also formally grammatical regex too (as are many things with no plausible likely intention as such)
That's not true. You can, for example, have unclosed brackets or parens in a Windows path. And if you're wondering why anyone would do that, consider MP3 files auto-named based on their ID3v1/FreeDB metadata, which truncates fields at 29 or 30 bytes. Anyway, as I said in the same message, it wouldn't be a useful heuristic because there's so much overlap, but you don't need to exaggerate that to make the same point.

I think we're down to quibbling over the meaning of "virtually" here. I recognize it's "not all" and you recognize it's "most" windows paths are grammatically regexen. So is it 80%, 95%, 99.9%? And do we mean "paths found in the wild" or "paths as systematically enumerated" from possibility space? On Oct 26, 2015 3:23 PM, "Andrew Barnert" <abarnert@yahoo.com> wrote:

On Tue, Oct 27, 2015 at 8:44 AM, Ben Finney <ben+python@benfinney.id.au> wrote:
"Rather more difficult" may be an understatement. A regex can contain grouping parentheses which can arbitrarily nest, and matching that with a regex is, AIUI, fundamentally impossible. So I don't think it's possible to have a regex that validates a regex. Fortunately, it's easy to write a function that validates a regex. def is_regex(s): try: re.compile(s) except re.error: return False return True ChrisA

On 10/26/2015 3:38 PM, Andrew Barnert via Python-ideas wrote:
Just for fun: is there a Python regex that matches all valid Python regexes?
Could you take this discussion to python-list. The PEP8 proposal has been rejected. -- Terry Jan Reedy
participants (9)
-
Alexander Walters
-
Andrew Barnert
-
Ben Finney
-
Chris Angelico
-
David Mertz
-
Henshaw, Andy
-
Ned Batchelder
-
Serhiy Storchaka
-
Terry Reedy