
Abe Dillon writes:
Note that the entire documentation is 250 words while just the syntax portion of Python docs for the re module is over 3000 words.
Since Verbal Expressions (below, VEs, indicating notation) "compile" to regular expressions (spelling out indicates the internal matching implementation), the documentation of VEs presumably ignores everything except the limited language it's useful for. To actually understand VEs, you need to refer to the RE docs. Not a win IMO.
You think that example is more readable than the proposed transalation ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$ which is better written ^https?://(www\.)?[^ ]*$ or even ^https?://[^ ]*$
Yes. I find it *far* more readable. It's not a soup of symbols like Perl code. I can only surmise that you're fluent in regex because it seems difficult for you to see how the above could be less readable than English words.
Yes, I'm fairly fluent in regular expression notation (below, REs). I've maintained a compiler for one dialect. I'm not interested in the difference between words and punctuation though. The reason I find the middle RE most readable is that it "looks like" what it's supposed to match, in a contiguous string as the object it will match will be contiguous. If I need to parse it to figure out *exactly* what it matches, yes, that takes more effort. But to understand a VE's semantics correctly, I'd have to look it up as often as you have to look up REs because many words chosen to notate VEs have English meanings that are (a) ambiguous, as in all natural language, and (b) only approximate matches to RE semantics.
I could tell it only matches URLs that are the only thing inside the string because it clearly says: start_of_line() and end_of_line().
That's not the problem. The problem is the semantics of the method "find". "then" would indeed read better, although it doesn't exactly match the semantics of concatenation in REs.
I would have had to refer to a reference to know that "^" doesn't always mean "not", it sometimes means "start of string" and probably other things. I would also have to check a reference to know that "$" can mean "end of string" (and probably other things).
And you'll still have to do that when reading other people's REs.
Are those groups capturing in Verbal Expressions? The use of "find" (~ "search") rather than "match" is disconcerting to the experienced user.
You can alternately use the word "then". The source code is just one python file. It's very easy to read. I actually like "then" over "find" for the example:
You're missing the point. The reader does not get to choose the notation, the author does. I do understand what several varieties of RE mean, but the variations are of two kinds: basic versus extended (ie, what tokens need to be escaped to be taken literally, which ones have special meaning if escaped), and extensions (which can be ignored). Modern RE facilities are essentially all of the extended variety. Once you've learned that, you're in good shape for almost any RE that should be written outside of an obfuscated code contest. This is a fundamental principle of Python design: don't make readers of code learn new things. That includes using notation developed elsewhere in many cases.
What does alternation look like?
.OR(option1).OR(option2).OR(option3)...
How about alternation of
non-trivial regular expressions?
.OR(other_verbal_expression)
Real examples, rather than pseudo code, would be nice. I think you, too, will find that examples of even fairly simple nested alternations containing other constructs become quite hard to read, as they fall off the bottom of the screen. For example, the VE equivalent of scheme = "(https?|ftp|file):" would be (AFAICT): scheme = VerEx().then(VerEx().then("http") .maybe("s") .OR("ftp") .OR("file")) .then(":") which is pretty hideous, I think. And the colon is captured by a group. If perversely I wanted to extract that group from a match, what would its index be? I guess you could keep the linear arrangement with scheme = (VerEx().add("(") .then("http") .maybe("s") .OR("ftp") .OR("file") .add(")") .then(":")) but is that really an improvement over scheme = VerEx().add("(https?|ftp|file):") ;-)
As far as I can see, Verbal Expressions are basically a way of making it so painful to write regular expressions that people will restrict themselves to regular expressions
What's so painful to write about them?
One thing that's painful is that VEs "look like" context-free grammars, but clumsy and without the powerful semantics. You can get the readability you want with greater power using grammars, which is why I would prefer we work on getting a parser module into the stdlib. But if one doesn't know about grammars, it's still not great. The main pains about writing VEs for me are (1) reading what I just wrote, (2) accessing capturing groups, and (3) verbosity. Even a VE to accurately match what is normally a fairly short string, such as the scheme, credentials, authority, and port portions of a "standard" URL, is going to be hundreds of characters long and likely dozens of lines if folded as in the examples. Another issue is that we already have a perfectly good poor man's matching library: glob. The URL example becomes http{,s}://{,www.}* Granted you lose the anchors, but how often does that matter? You apparently don't use them often enough to remember them.
Does your IDE not have autocompletion?
I don't want an IDE. I have Emacs.
I find REs so painful to write that I usually just use string methods if at all feasible.
Guess what? That's the right thing to do anyway. They're a lot more readable and efficient when partitioning a string into two or three parts, or recognizing a short list of affixes. But chaining many methods, as VEs do, is not a very Pythonic way to write a program.
I don't think that this failure to respect the developer's taste is restricted to this particular implementation, either.
I generally find it distasteful to write a pseudolanguage in strings inside of other languages (this applies to SQL as well).
You mean like arithmetic operators? (Lisp does this right, right? Only one kind of expression, the function call!) It's a matter of what you're used to. I understand that people new to text-processing, or who don't do so much of it, don't find REs easy to read. So how is this a huge loss? They don't use regular expressions very often! In fact, they're far more likely to encounter, and possibly need to understand, REs written by others!
Especially when the design principals of that pseudolanguage are *diametrically opposed* to the design principals of the host language. A key principal of Python's design is: "you read code a lot more often than you write code, so emphasize readability". Regex seems to be based on: "Do the most with the fewest key-strokes.
So is all of mathematics. There's nothing wrong with concise expression for use in special cases.
Readability be dammed!". It makes a lot more sense to wrap the psudolanguage in constructs that bring it in-line with the host language than to take on the mental burden of trying to comprehend two different languages at the same time.
If you disagree, nothing's stopping you from continuing to write res the old-fashion way.
I don't think that RE and SQL are "pseudo" languages, no. And I, and most developers, will continue to write regular expressions using the much more compact and expressive RE notation. (In fact with the exception of the "word" method, in VEs you still need to use RE notion to express most of the Python extensions.) So what you're saying is that you don't read much code, except maybe your own. Isn't that your problem? Those of us who cooperate widely on applications using regular expressions will continue to communicate using REs. If that leaves you out, that's not good. But adding VEs to the stdlib (and thus encouraging their use) will split the community into RE users and VE users, if VEs are at all useful. That's a bad. I don't see that the potential usefulness of VEs to infrequent users of regular expressions outweighing the downsides of "many ways to do it" in the stdlib.
Can we at least agree that baking special re syntax directly into the language is a bad idea?
I agree that there's no particular need for RE literals. If one wants to mark an RE as some special kind of object, re.compile() does that very well both by converting to a different type internally and as a marker syntactically.
On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
We don't really want to ease the use of regexps in Python - while they're an incredibly useful tool in a programmer's toolkit, they're so cryptic that they're almost inevitably a maintainability nightmare.
I agree with Nick. Regular expressions, whatever the notation, are a useful tool (no suspension of disbelief necessary for me, though!). But they are cryptic, and it's not just the notation. People (even experienced RE users) are often surprised by what fairly simple regular expression match in a given text, because people want to read a regexp as instructions to a one-pass greedy parser, and it isn't. For example, above I wrote scheme = "(https?|ftp|file):" rather than scheme = "(\w+):" because it's not unlikely that I would want to treat those differently from other schemes such as mailto, news, and doi. In many applications of regular expressions (such as tokenization for a parser) you need many expressions. Compactness really is a virtue in REs. Steve