<div dir="ltr">Same.  One day, Python will have a decent parsing library.<br><br>On Friday, March 31, 2017 at 4:21:51 AM UTC-4, Stephan Houben wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">Hi all,

<br>

<br>FWIW, I also strongly prefer the Verbal Expression style and consider

<br>"normal" regular expressions to become quickly unreadable and

unmaintainable.

<br>

Verbal Expressions are also much more composable.

<br>

<br>Stephan

<br>

<br>2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull

<br><<a href="javascript:" target="_blank" gdf-obfuscated-mailto="a8ktg9s4CgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">turnbull....@u.<wbr>tsukuba.ac.jp</a>>:

<br>> Abe Dillon writes:

<br>>

<br>>  > Note that the entire documentation is 250 words while just the syntax

>  > portion of Python docs for the re module is over 3000 words.

<br>>

<br>> Since Verbal Expressions (below, VEs, indicating notation) "compile"

<br>> to regular expressions (spelling out indicates the internal matching

<br>> implementation), the documentation of VEs presumably ignores

<br>> everything except the limited language it's useful for.  To actually

> understand VEs, you need to refer to the RE docs.  Not a win IMO.

<br>>

<br>>  > > You think that example is more readable than the proposed transalation

<br>>  > >     ^(http)(s)?(\:\/\/)(www\.)?([^<wbr>\ ]*)$

<br>>  > > which is better written

<br>>  > >     ^https?://(www\.)?[^ ]*$

<br>>  > > or even

<br>>  > >     ^https?://[^ ]*$

<br>>  >

<br>>  >

<br>>  > Yes. I find it *far* more readable. It's not a soup of symbols like Perl

<br>>  > code. I can only surmise that you're fluent in regex because it seems

<br>>  > difficult for you to see how the above could be less readable than English

<br>>  > words.

<br>>

> Yes, I'm fairly fluent in regular expression notation (below, REs).

> I've maintained a compiler for one dialect.

<br>>

<br>> I'm not interested in the difference between words and punctuation

<br>> though.  The reason I find the middle RE most readable is that it

<br>> "looks like" what it's supposed to match, in a contiguous string as

<br>> the object it will match will be contiguous.  If I need to parse it to

> figure out *exactly* what it matches, yes, that takes more effort.

<br>> But to understand a VE's semantics correctly, I'd have to look it up

<br>> as often as you have to look up REs because many words chosen to notate

<br>> VEs have English meanings that are (a) ambiguous, as in all natural

> language, and (b) only approximate matches to RE semantics.

<br>>

<br>>  > I could tell it only matches URLs that are the only thing inside

<br>>  > the string because it clearly says: start_of_line() and

<br>>  > end_of_line().

<br>>

<br>> That's not the problem.  The problem is the semantics of the method

<br>> "find".  "then" would indeed read better, although it doesn't exactly

> match the semantics of concatenation in REs.

<br>>

<br>>  > I would have had to refer to a reference to know that "^" doesn't

<br>>  > always mean "not", it sometimes means "start of string" and

<br>>  > probably other things. I would also have to check a reference to

>  > know that "$" can mean "end of string" (and probably other things).

<br>>

> And you'll still have to do that when reading other people's REs.

<br>>

<br>>  > > Are those groups capturing in Verbal Expressions?  The use of

<br>>  > > "find" (~ "search") rather than "match" is disconcerting to the

<br>>  > > experienced user.

<br>>  >

<br>>  > You can alternately use the word "then". The source code is just

<br>>  > one python file. It's very easy to read. I actually like "then"

<br>>  > over "find" for the example:

<br>>

<br>> You're missing the point.  The reader does not get to choose the

<br>> notation, the author does.  I do understand what several varieties of

<br>> RE mean, but the variations are of two kinds: basic versus extended

<br>> (ie, what tokens need to be escaped to be taken literally, which ones

<br>> have special meaning if escaped), and extensions (which can be

<br>> ignored).  Modern RE facilities are essentially all of the extended

<br>> variety.  Once you've learned that, you're in good shape for almost

> any RE that should be written outside of an obfuscated code contest.

<br>>

<br>> This is a fundamental principle of Python design: don't make readers

<br>> of code learn new things.  That includes using notation developed

> elsewhere in many cases.

<br>>

>  > What does alternation look like?

<br>>  >

<br>>  > .OR(option1).OR(option2).OR(<wbr>option3)...

<br>>  >

<br>>  > How about alternation of

<br>>  > > non-trivial regular expressions?

<br>>  >

<br>>  > .OR(other_verbal_expression)

<br>>

<br>> Real examples, rather than pseudo code, would be nice.  I think you,

<br>> too, will find that examples of even fairly simple nested alternations

<br>> containing other constructs become quite hard to read, as they fall

> off the bottom of the screen.

<br>>

<br>> For example, the VE equivalent of

<br>>

<br>>     scheme = "(https?|ftp|file):"

<br>>

<br>> would be (AFAICT):

<br>>

<br>>     scheme = VerEx().then(VerEx().then("<wbr>http")

<br>>                                  .maybe("s")

<br>>                                  .OR("ftp")

<br>>                                  .OR("file"))

<br>>                     .then(":")

<br>>

<br>> which is pretty hideous, I think.  And the colon is captured by a

<br>> group.  If perversely I wanted to extract that group from a match,

> what would its index be?

<br>>

<br>> I guess you could keep the linear arrangement with

<br>>

<br>>     scheme = (VerEx().add("(")

<br>>                      .then("http")

<br>>                      .maybe("s")

<br>>                      .OR("ftp")

<br>>                      .OR("file")

<br>>                      .add(")")

<br>>                      .then(":"))

<br>>

<br>> but is that really an improvement over

<br>>

<br>>     scheme = VerEx().add("(https?|ftp|file)<wbr>:")

<br>>

<br>> ;-)

<br>>

<br>>  > > As far as I can see, Verbal Expressions are basically a way of

<br>>  > > making it so painful to write regular expressions that people

<br>>  > > will restrict themselves to regular expressions

<br>>  >

>  > What's so painful to write about them?

<br>>

<br>> One thing that's painful is that VEs "look like" context-free

<br>> grammars, but clumsy and without the powerful semantics.  You can get

<br>> the readability you want with greater power using grammars, which is

> why I would prefer we work on getting a parser module into the stdlib.

<br>>

<br>> But if one doesn't know about grammars, it's still not great.  The

<br>> main pains about writing VEs for me are (1) reading what I just wrote,

<br>> (2) accessing capturing groups, and (3) verbosity.  Even a VE to

<br>> accurately match what is normally a fairly short string, such as the

<br>> scheme, credentials, authority, and port portions of a "standard" URL,

<br>> is going to be hundreds of characters long and likely dozens of lines

> if folded as in the examples.

<br>>

<br>> Another issue is that we already have a perfectly good poor man's

<br>> matching library: glob.  The URL example becomes

<br>>

<br>>     http{,s}://{,www.}*

<br>>

<br>> Granted you lose the anchors, but how often does that matter?  You

> apparently don't use them often enough to remember them.

<br>>

>  > Does your IDE not have autocompletion?

<br>>

> I don't want an IDE.  I have Emacs.

<br>>

<br>>  > I find REs so painful to write that I usually just use string

>  > methods if at all feasible.

<br>>

<br>> Guess what?  That's the right thing to do anyway.  They're a lot more

<br>> readable and efficient when partitioning a string into two or three

<br>> parts, or recognizing a short list of affixes.  But chaining many

> methods, as VEs do, is not a very Pythonic way to write a program.

<br>>

<br>>  > > I don't think that this failure to respect the developer's taste

>  > > is restricted to this particular implementation, either.

<br>>  >

<br>>  > I generally find it distasteful to write a pseudolanguage in

>  > strings inside of other languages (this applies to SQL as well).

<br>>

> You mean like arithmetic operators?  (Lisp does this right, right?

<br>> Only one kind of expression, the function call!)  It's a matter of

<br>> what you're used to.  I understand that people new to text-processing,

<br>> or who don't do so much of it, don't find REs easy to read.  So how is

<br>> this a huge loss?  They don't use regular expressions very often!  In

<br>> fact, they're far more likely to encounter, and possibly need to

> understand, REs written by others!

<br>>

<br>>  > Especially when the design principals of that pseudolanguage are

<br>>  > *diametrically opposed* to the design principals of the host

<br>>  > language. A key principal of Python's design is: "you read code a

<br>>  > lot more often than you write code, so emphasize

<br>>  > readability". Regex seems to be based on: "Do the most with the

<br>>  > fewest key-strokes.

<br>>

<br>> So is all of mathematics.  There's nothing wrong with concise

> expression for use in special cases.

<br>>

<br>>  > Readability be dammed!". It makes a lot more sense to wrap the

<br>>  > psudolanguage in constructs that bring it in-line with the host

<br>>  > language than to take on the mental burden of trying to comprehend

>  > two different languages at the same time.

<br>>  >

<br>>  > If you disagree, nothing's stopping you from continuing to write

<br>>  > res the old-fashion way.

<br>>

<br>> I don't think that RE and SQL are "pseudo" languages, no.  And I, and

<br>> most developers, will continue to write regular expressions using the

<br>> much more compact and expressive RE notation.  (In fact with the

<br>> exception of the "word" method, in VEs you still need to use RE notion

<br>> to express most of the Python extensions.)  So what you're saying is

<br>> that you don't read much code, except maybe your own.  Isn't that your

<br>> problem?  Those of us who cooperate widely on applications using

<br>> regular expressions will continue to communicate using REs.  If that

<br>> leaves you out, that's not good.  But adding VEs to the stdlib (and

<br>> thus encouraging their use) will split the community into RE users and

<br>> VE users, if VEs are at all useful.  That's a bad.  I don't see that

<br>> the potential usefulness of VEs to infrequent users of regular

<br>> expressions outweighing the downsides of "many ways to do it" in the

<br>> stdlib.

<br>>

<br>>  > Can we at least agree that baking special re syntax directly into

>  > the language is a bad idea?

<br>>

<br>> I agree that there's no particular need for RE literals.  If one wants

<br>> to mark an RE as some special kind of object, re.compile() does that

<br>> very well both by converting to a different type internally and as a

<br>> marker syntactically.

<br>>

<br>>  > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="a8ktg9s4CgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">ncog...@gmail.com</a>> wrote:

<br>>  >

<br>>  > > We don't really want to ease the use of regexps in Python - while

<br>>  > > they're an incredibly useful tool in a programmer's toolkit,

<br>>  > > they're so cryptic that they're almost inevitably a

<br>>  > > maintainability nightmare.

<br>>

<br>> I agree with Nick.  Regular expressions, whatever the notation, are a

> useful tool (no suspension of disbelief necessary for me, though!).

<br>> But they are cryptic, and it's not just the notation.  People (even

<br>> experienced RE users) are often surprised by what fairly simple

<br>> regular expression match in a given text, because people want to read

> a regexp as instructions to a one-pass greedy parser, and it isn't.

<br>>

<br>> For example, above I wrote

<br>>

<br>>     scheme = "(https?|ftp|file):"

<br>>

<br>> rather than

<br>>

<br>>     scheme = "(\w+):"

<br>>

<br>> because it's not unlikely that I would want to treat those differently

<br>> from other schemes such as mailto, news, and doi.  In many

<br>> applications of regular expressions (such as tokenization for a

<br>> parser) you need many expressions.  Compactness really is a virtue in

<br>> REs.

<br>>

<br>> Steve

<br>>

<br>> ______________________________<wbr>_________________

<br>> Python-ideas mailing list

<br>> <a href="javascript:" target="_blank" gdf-obfuscated-mailto="a8ktg9s4CgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">Python...@python.org</a>

<br>> <a href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;">https://mail.python.org/<wbr>mailman/listinfo/python-ideas</a>

<br>> Code of Conduct: <a href="http://python.org/psf/codeofconduct/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;" onclick="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;">http://python.org/psf/<wbr>codeofconduct/</a>

<br>______________________________<wbr>_________________

<br>Python-ideas mailing list

<br><a href="javascript:" target="_blank" gdf-obfuscated-mailto="a8ktg9s4CgAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">Python...@python.org</a>

<br><a href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFj1EaNHnVmh20FnFPoUi4J-MpfQw';return true;">https://mail.python.org/<wbr>mailman/listinfo/python-ideas</a>

<br>Code of Conduct: <a href="http://python.org/psf/codeofconduct/" target="_blank" rel="nofollow" onmousedown="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;" onclick="this.href='http://www.google.com/url?q\x3dhttp%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHJOrArSUDKkjrnthO6_CznMzkPsA';return true;">http://python.org/psf/<wbr>codeofconduct/</a>

<br></blockquote></div>