[Python-ideas] What about regexp string litterals : re".*" ?
Neil Girdhar
mistersheik at gmail.com
Sun Apr 2 21:22:08 EDT 2017
Same. One day, Python will have a decent parsing library.
On Friday, March 31, 2017 at 4:21:51 AM UTC-4, Stephan Houben wrote:
>
> Hi all,
>
> FWIW, I also strongly prefer the Verbal Expression style and consider
> "normal" regular expressions to become quickly unreadable and
> unmaintainable.
>
> Verbal Expressions are also much more composable.
>
> Stephan
>
> 2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull
> <turnbull.... at u.tsukuba.ac.jp <javascript:>>:
> > Abe Dillon writes:
> >
> > > Note that the entire documentation is 250 words while just the syntax
> > > portion of Python docs for the re module is over 3000 words.
> >
> > Since Verbal Expressions (below, VEs, indicating notation) "compile"
> > to regular expressions (spelling out indicates the internal matching
> > implementation), the documentation of VEs presumably ignores
> > everything except the limited language it's useful for. To actually
> > understand VEs, you need to refer to the RE docs. Not a win IMO.
> >
> > > > You think that example is more readable than the proposed
> transalation
> > > > ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
> > > > which is better written
> > > > ^https?://(www\.)?[^ ]*$
> > > > or even
> > > > ^https?://[^ ]*$
> > >
> > >
> > > Yes. I find it *far* more readable. It's not a soup of symbols like
> Perl
> > > code. I can only surmise that you're fluent in regex because it seems
> > > difficult for you to see how the above could be less readable than
> English
> > > words.
> >
> > Yes, I'm fairly fluent in regular expression notation (below, REs).
> > I've maintained a compiler for one dialect.
> >
> > I'm not interested in the difference between words and punctuation
> > though. The reason I find the middle RE most readable is that it
> > "looks like" what it's supposed to match, in a contiguous string as
> > the object it will match will be contiguous. If I need to parse it to
> > figure out *exactly* what it matches, yes, that takes more effort.
> > But to understand a VE's semantics correctly, I'd have to look it up
> > as often as you have to look up REs because many words chosen to notate
> > VEs have English meanings that are (a) ambiguous, as in all natural
> > language, and (b) only approximate matches to RE semantics.
> >
> > > I could tell it only matches URLs that are the only thing inside
> > > the string because it clearly says: start_of_line() and
> > > end_of_line().
> >
> > That's not the problem. The problem is the semantics of the method
> > "find". "then" would indeed read better, although it doesn't exactly
> > match the semantics of concatenation in REs.
> >
> > > I would have had to refer to a reference to know that "^" doesn't
> > > always mean "not", it sometimes means "start of string" and
> > > probably other things. I would also have to check a reference to
> > > know that "$" can mean "end of string" (and probably other things).
> >
> > And you'll still have to do that when reading other people's REs.
> >
> > > > Are those groups capturing in Verbal Expressions? The use of
> > > > "find" (~ "search") rather than "match" is disconcerting to the
> > > > experienced user.
> > >
> > > You can alternately use the word "then". The source code is just
> > > one python file. It's very easy to read. I actually like "then"
> > > over "find" for the example:
> >
> > You're missing the point. The reader does not get to choose the
> > notation, the author does. I do understand what several varieties of
> > RE mean, but the variations are of two kinds: basic versus extended
> > (ie, what tokens need to be escaped to be taken literally, which ones
> > have special meaning if escaped), and extensions (which can be
> > ignored). Modern RE facilities are essentially all of the extended
> > variety. Once you've learned that, you're in good shape for almost
> > any RE that should be written outside of an obfuscated code contest.
> >
> > This is a fundamental principle of Python design: don't make readers
> > of code learn new things. That includes using notation developed
> > elsewhere in many cases.
> >
> > > What does alternation look like?
> > >
> > > .OR(option1).OR(option2).OR(option3)...
> > >
> > > How about alternation of
> > > > non-trivial regular expressions?
> > >
> > > .OR(other_verbal_expression)
> >
> > Real examples, rather than pseudo code, would be nice. I think you,
> > too, will find that examples of even fairly simple nested alternations
> > containing other constructs become quite hard to read, as they fall
> > off the bottom of the screen.
> >
> > For example, the VE equivalent of
> >
> > scheme = "(https?|ftp|file):"
> >
> > would be (AFAICT):
> >
> > scheme = VerEx().then(VerEx().then("http")
> > .maybe("s")
> > .OR("ftp")
> > .OR("file"))
> > .then(":")
> >
> > which is pretty hideous, I think. And the colon is captured by a
> > group. If perversely I wanted to extract that group from a match,
> > what would its index be?
> >
> > I guess you could keep the linear arrangement with
> >
> > scheme = (VerEx().add("(")
> > .then("http")
> > .maybe("s")
> > .OR("ftp")
> > .OR("file")
> > .add(")")
> > .then(":"))
> >
> > but is that really an improvement over
> >
> > scheme = VerEx().add("(https?|ftp|file):")
> >
> > ;-)
> >
> > > > As far as I can see, Verbal Expressions are basically a way of
> > > > making it so painful to write regular expressions that people
> > > > will restrict themselves to regular expressions
> > >
> > > What's so painful to write about them?
> >
> > One thing that's painful is that VEs "look like" context-free
> > grammars, but clumsy and without the powerful semantics. You can get
> > the readability you want with greater power using grammars, which is
> > why I would prefer we work on getting a parser module into the stdlib.
> >
> > But if one doesn't know about grammars, it's still not great. The
> > main pains about writing VEs for me are (1) reading what I just wrote,
> > (2) accessing capturing groups, and (3) verbosity. Even a VE to
> > accurately match what is normally a fairly short string, such as the
> > scheme, credentials, authority, and port portions of a "standard" URL,
> > is going to be hundreds of characters long and likely dozens of lines
> > if folded as in the examples.
> >
> > Another issue is that we already have a perfectly good poor man's
> > matching library: glob. The URL example becomes
> >
> > http{,s}://{,www.}*
> >
> > Granted you lose the anchors, but how often does that matter? You
> > apparently don't use them often enough to remember them.
> >
> > > Does your IDE not have autocompletion?
> >
> > I don't want an IDE. I have Emacs.
> >
> > > I find REs so painful to write that I usually just use string
> > > methods if at all feasible.
> >
> > Guess what? That's the right thing to do anyway. They're a lot more
> > readable and efficient when partitioning a string into two or three
> > parts, or recognizing a short list of affixes. But chaining many
> > methods, as VEs do, is not a very Pythonic way to write a program.
> >
> > > > I don't think that this failure to respect the developer's taste
> > > > is restricted to this particular implementation, either.
> > >
> > > I generally find it distasteful to write a pseudolanguage in
> > > strings inside of other languages (this applies to SQL as well).
> >
> > You mean like arithmetic operators? (Lisp does this right, right?
> > Only one kind of expression, the function call!) It's a matter of
> > what you're used to. I understand that people new to text-processing,
> > or who don't do so much of it, don't find REs easy to read. So how is
> > this a huge loss? They don't use regular expressions very often! In
> > fact, they're far more likely to encounter, and possibly need to
> > understand, REs written by others!
> >
> > > Especially when the design principals of that pseudolanguage are
> > > *diametrically opposed* to the design principals of the host
> > > language. A key principal of Python's design is: "you read code a
> > > lot more often than you write code, so emphasize
> > > readability". Regex seems to be based on: "Do the most with the
> > > fewest key-strokes.
> >
> > So is all of mathematics. There's nothing wrong with concise
> > expression for use in special cases.
> >
> > > Readability be dammed!". It makes a lot more sense to wrap the
> > > psudolanguage in constructs that bring it in-line with the host
> > > language than to take on the mental burden of trying to comprehend
> > > two different languages at the same time.
> > >
> > > If you disagree, nothing's stopping you from continuing to write
> > > res the old-fashion way.
> >
> > I don't think that RE and SQL are "pseudo" languages, no. And I, and
> > most developers, will continue to write regular expressions using the
> > much more compact and expressive RE notation. (In fact with the
> > exception of the "word" method, in VEs you still need to use RE notion
> > to express most of the Python extensions.) So what you're saying is
> > that you don't read much code, except maybe your own. Isn't that your
> > problem? Those of us who cooperate widely on applications using
> > regular expressions will continue to communicate using REs. If that
> > leaves you out, that's not good. But adding VEs to the stdlib (and
> > thus encouraging their use) will split the community into RE users and
> > VE users, if VEs are at all useful. That's a bad. I don't see that
> > the potential usefulness of VEs to infrequent users of regular
> > expressions outweighing the downsides of "many ways to do it" in the
> > stdlib.
> >
> > > Can we at least agree that baking special re syntax directly into
> > > the language is a bad idea?
> >
> > I agree that there's no particular need for RE literals. If one wants
> > to mark an RE as some special kind of object, re.compile() does that
> > very well both by converting to a different type internally and as a
> > marker syntactically.
> >
> > > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncog... at gmail.com
> <javascript:>> wrote:
> > >
> > > > We don't really want to ease the use of regexps in Python - while
> > > > they're an incredibly useful tool in a programmer's toolkit,
> > > > they're so cryptic that they're almost inevitably a
> > > > maintainability nightmare.
> >
> > I agree with Nick. Regular expressions, whatever the notation, are a
> > useful tool (no suspension of disbelief necessary for me, though!).
> > But they are cryptic, and it's not just the notation. People (even
> > experienced RE users) are often surprised by what fairly simple
> > regular expression match in a given text, because people want to read
> > a regexp as instructions to a one-pass greedy parser, and it isn't.
> >
> > For example, above I wrote
> >
> > scheme = "(https?|ftp|file):"
> >
> > rather than
> >
> > scheme = "(\w+):"
> >
> > because it's not unlikely that I would want to treat those differently
> > from other schemes such as mailto, news, and doi. In many
> > applications of regular expressions (such as tokenization for a
> > parser) you need many expressions. Compactness really is a virtue in
> > REs.
> >
> > Steve
> >
> > _______________________________________________
> > Python-ideas mailing list
> > Python... at python.org <javascript:>
> > https://mail.python.org/mailman/listinfo/python-ideas
> > Code of Conduct: http://python.org/psf/codeofconduct/
> _______________________________________________
> Python-ideas mailing list
> Python... at python.org <javascript:>
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170402/a3364064/attachment-0001.html>
More information about the Python-ideas
mailing list