[Python-ideas] What about regexp string litterals : re".*" ?

Neil Girdhar mistersheik at gmail.com
Sun Apr 2 21:22:08 EDT 2017


Same.  One day, Python will have a decent parsing library.

On Friday, March 31, 2017 at 4:21:51 AM UTC-4, Stephan Houben wrote:
>
> Hi all, 
>
> FWIW, I also strongly prefer the Verbal Expression style and consider 
> "normal" regular expressions to become quickly unreadable and 
> unmaintainable. 
>
> Verbal Expressions are also much more composable. 
>
> Stephan 
>
> 2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull 
> <turnbull.... at u.tsukuba.ac.jp <javascript:>>: 
> > Abe Dillon writes: 
> > 
> >  > Note that the entire documentation is 250 words while just the syntax 
> >  > portion of Python docs for the re module is over 3000 words. 
> > 
> > Since Verbal Expressions (below, VEs, indicating notation) "compile" 
> > to regular expressions (spelling out indicates the internal matching 
> > implementation), the documentation of VEs presumably ignores 
> > everything except the limited language it's useful for.  To actually 
> > understand VEs, you need to refer to the RE docs.  Not a win IMO. 
> > 
> >  > > You think that example is more readable than the proposed 
> transalation 
> >  > >     ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$ 
> >  > > which is better written 
> >  > >     ^https?://(www\.)?[^ ]*$ 
> >  > > or even 
> >  > >     ^https?://[^ ]*$ 
> >  > 
> >  > 
> >  > Yes. I find it *far* more readable. It's not a soup of symbols like 
> Perl 
> >  > code. I can only surmise that you're fluent in regex because it seems 
> >  > difficult for you to see how the above could be less readable than 
> English 
> >  > words. 
> > 
> > Yes, I'm fairly fluent in regular expression notation (below, REs). 
> > I've maintained a compiler for one dialect. 
> > 
> > I'm not interested in the difference between words and punctuation 
> > though.  The reason I find the middle RE most readable is that it 
> > "looks like" what it's supposed to match, in a contiguous string as 
> > the object it will match will be contiguous.  If I need to parse it to 
> > figure out *exactly* what it matches, yes, that takes more effort. 
> > But to understand a VE's semantics correctly, I'd have to look it up 
> > as often as you have to look up REs because many words chosen to notate 
> > VEs have English meanings that are (a) ambiguous, as in all natural 
> > language, and (b) only approximate matches to RE semantics. 
> > 
> >  > I could tell it only matches URLs that are the only thing inside 
> >  > the string because it clearly says: start_of_line() and 
> >  > end_of_line(). 
> > 
> > That's not the problem.  The problem is the semantics of the method 
> > "find".  "then" would indeed read better, although it doesn't exactly 
> > match the semantics of concatenation in REs. 
> > 
> >  > I would have had to refer to a reference to know that "^" doesn't 
> >  > always mean "not", it sometimes means "start of string" and 
> >  > probably other things. I would also have to check a reference to 
> >  > know that "$" can mean "end of string" (and probably other things). 
> > 
> > And you'll still have to do that when reading other people's REs. 
> > 
> >  > > Are those groups capturing in Verbal Expressions?  The use of 
> >  > > "find" (~ "search") rather than "match" is disconcerting to the 
> >  > > experienced user. 
> >  > 
> >  > You can alternately use the word "then". The source code is just 
> >  > one python file. It's very easy to read. I actually like "then" 
> >  > over "find" for the example: 
> > 
> > You're missing the point.  The reader does not get to choose the 
> > notation, the author does.  I do understand what several varieties of 
> > RE mean, but the variations are of two kinds: basic versus extended 
> > (ie, what tokens need to be escaped to be taken literally, which ones 
> > have special meaning if escaped), and extensions (which can be 
> > ignored).  Modern RE facilities are essentially all of the extended 
> > variety.  Once you've learned that, you're in good shape for almost 
> > any RE that should be written outside of an obfuscated code contest. 
> > 
> > This is a fundamental principle of Python design: don't make readers 
> > of code learn new things.  That includes using notation developed 
> > elsewhere in many cases. 
> > 
> >  > What does alternation look like? 
> >  > 
> >  > .OR(option1).OR(option2).OR(option3)... 
> >  > 
> >  > How about alternation of 
> >  > > non-trivial regular expressions? 
> >  > 
> >  > .OR(other_verbal_expression) 
> > 
> > Real examples, rather than pseudo code, would be nice.  I think you, 
> > too, will find that examples of even fairly simple nested alternations 
> > containing other constructs become quite hard to read, as they fall 
> > off the bottom of the screen. 
> > 
> > For example, the VE equivalent of 
> > 
> >     scheme = "(https?|ftp|file):" 
> > 
> > would be (AFAICT): 
> > 
> >     scheme = VerEx().then(VerEx().then("http") 
> >                                  .maybe("s") 
> >                                  .OR("ftp") 
> >                                  .OR("file")) 
> >                     .then(":") 
> > 
> > which is pretty hideous, I think.  And the colon is captured by a 
> > group.  If perversely I wanted to extract that group from a match, 
> > what would its index be? 
> > 
> > I guess you could keep the linear arrangement with 
> > 
> >     scheme = (VerEx().add("(") 
> >                      .then("http") 
> >                      .maybe("s") 
> >                      .OR("ftp") 
> >                      .OR("file") 
> >                      .add(")") 
> >                      .then(":")) 
> > 
> > but is that really an improvement over 
> > 
> >     scheme = VerEx().add("(https?|ftp|file):") 
> > 
> > ;-) 
> > 
> >  > > As far as I can see, Verbal Expressions are basically a way of 
> >  > > making it so painful to write regular expressions that people 
> >  > > will restrict themselves to regular expressions 
> >  > 
> >  > What's so painful to write about them? 
> > 
> > One thing that's painful is that VEs "look like" context-free 
> > grammars, but clumsy and without the powerful semantics.  You can get 
> > the readability you want with greater power using grammars, which is 
> > why I would prefer we work on getting a parser module into the stdlib. 
> > 
> > But if one doesn't know about grammars, it's still not great.  The 
> > main pains about writing VEs for me are (1) reading what I just wrote, 
> > (2) accessing capturing groups, and (3) verbosity.  Even a VE to 
> > accurately match what is normally a fairly short string, such as the 
> > scheme, credentials, authority, and port portions of a "standard" URL, 
> > is going to be hundreds of characters long and likely dozens of lines 
> > if folded as in the examples. 
> > 
> > Another issue is that we already have a perfectly good poor man's 
> > matching library: glob.  The URL example becomes 
> > 
> >     http{,s}://{,www.}* 
> > 
> > Granted you lose the anchors, but how often does that matter?  You 
> > apparently don't use them often enough to remember them. 
> > 
> >  > Does your IDE not have autocompletion? 
> > 
> > I don't want an IDE.  I have Emacs. 
> > 
> >  > I find REs so painful to write that I usually just use string 
> >  > methods if at all feasible. 
> > 
> > Guess what?  That's the right thing to do anyway.  They're a lot more 
> > readable and efficient when partitioning a string into two or three 
> > parts, or recognizing a short list of affixes.  But chaining many 
> > methods, as VEs do, is not a very Pythonic way to write a program. 
> > 
> >  > > I don't think that this failure to respect the developer's taste 
> >  > > is restricted to this particular implementation, either. 
> >  > 
> >  > I generally find it distasteful to write a pseudolanguage in 
> >  > strings inside of other languages (this applies to SQL as well). 
> > 
> > You mean like arithmetic operators?  (Lisp does this right, right? 
> > Only one kind of expression, the function call!)  It's a matter of 
> > what you're used to.  I understand that people new to text-processing, 
> > or who don't do so much of it, don't find REs easy to read.  So how is 
> > this a huge loss?  They don't use regular expressions very often!  In 
> > fact, they're far more likely to encounter, and possibly need to 
> > understand, REs written by others! 
> > 
> >  > Especially when the design principals of that pseudolanguage are 
> >  > *diametrically opposed* to the design principals of the host 
> >  > language. A key principal of Python's design is: "you read code a 
> >  > lot more often than you write code, so emphasize 
> >  > readability". Regex seems to be based on: "Do the most with the 
> >  > fewest key-strokes. 
> > 
> > So is all of mathematics.  There's nothing wrong with concise 
> > expression for use in special cases. 
> > 
> >  > Readability be dammed!". It makes a lot more sense to wrap the 
> >  > psudolanguage in constructs that bring it in-line with the host 
> >  > language than to take on the mental burden of trying to comprehend 
> >  > two different languages at the same time. 
> >  > 
> >  > If you disagree, nothing's stopping you from continuing to write 
> >  > res the old-fashion way. 
> > 
> > I don't think that RE and SQL are "pseudo" languages, no.  And I, and 
> > most developers, will continue to write regular expressions using the 
> > much more compact and expressive RE notation.  (In fact with the 
> > exception of the "word" method, in VEs you still need to use RE notion 
> > to express most of the Python extensions.)  So what you're saying is 
> > that you don't read much code, except maybe your own.  Isn't that your 
> > problem?  Those of us who cooperate widely on applications using 
> > regular expressions will continue to communicate using REs.  If that 
> > leaves you out, that's not good.  But adding VEs to the stdlib (and 
> > thus encouraging their use) will split the community into RE users and 
> > VE users, if VEs are at all useful.  That's a bad.  I don't see that 
> > the potential usefulness of VEs to infrequent users of regular 
> > expressions outweighing the downsides of "many ways to do it" in the 
> > stdlib. 
> > 
> >  > Can we at least agree that baking special re syntax directly into 
> >  > the language is a bad idea? 
> > 
> > I agree that there's no particular need for RE literals.  If one wants 
> > to mark an RE as some special kind of object, re.compile() does that 
> > very well both by converting to a different type internally and as a 
> > marker syntactically. 
> > 
> >  > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncog... at gmail.com 
> <javascript:>> wrote: 
> >  > 
> >  > > We don't really want to ease the use of regexps in Python - while 
> >  > > they're an incredibly useful tool in a programmer's toolkit, 
> >  > > they're so cryptic that they're almost inevitably a 
> >  > > maintainability nightmare. 
> > 
> > I agree with Nick.  Regular expressions, whatever the notation, are a 
> > useful tool (no suspension of disbelief necessary for me, though!). 
> > But they are cryptic, and it's not just the notation.  People (even 
> > experienced RE users) are often surprised by what fairly simple 
> > regular expression match in a given text, because people want to read 
> > a regexp as instructions to a one-pass greedy parser, and it isn't. 
> > 
> > For example, above I wrote 
> > 
> >     scheme = "(https?|ftp|file):" 
> > 
> > rather than 
> > 
> >     scheme = "(\w+):" 
> > 
> > because it's not unlikely that I would want to treat those differently 
> > from other schemes such as mailto, news, and doi.  In many 
> > applications of regular expressions (such as tokenization for a 
> > parser) you need many expressions.  Compactness really is a virtue in 
> > REs. 
> > 
> > Steve 
> > 
> > _______________________________________________ 
> > Python-ideas mailing list 
> > Python... at python.org <javascript:> 
> > https://mail.python.org/mailman/listinfo/python-ideas 
> > Code of Conduct: http://python.org/psf/codeofconduct/ 
> _______________________________________________ 
> Python-ideas mailing list 
> Python... at python.org <javascript:> 
> https://mail.python.org/mailman/listinfo/python-ideas 
> Code of Conduct: http://python.org/psf/codeofconduct/ 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170402/a3364064/attachment-0001.html>


More information about the Python-ideas mailing list