PyWart: Python regular expression syntax is not intuitive.

Ian Kelly ian.g.kelly at gmail.com
Wed Jan 25 15:17:11 EST 2012


On Wed, Jan 25, 2012 at 10:16 AM, Rick Johnson
<rantingrickjohnson at gmail.com> wrote:
> (?...)  # Base Extension Syntax
> All extensions are wrapped in parenthesis and start with a question
> mark, but i believe the question mark was a very bad choice, since the
> question mark is already specific to "zero or one repetitions of
> preceding RE". This simple error is why i believe Python re's are so
> damn difficult to eyeball parse. You'll constantly be forced to spend
> too much time deciding if this question mark is a referring to
> repeats, or is the start of an extension syntax. We should have
> choosen another char, and the char should NOT be known to RE in any
> other place. Maybe the tilde would work? Wait, i have a MUCH better
> idea!!!

Did you read the very first sentence of the re module documentation?
"This module provides regular expression matching operations *similar
to those found in Perl*" (my emphasis).  The goal here is
compatibility with existing RE syntaxes, not readability.  Perl uses
the (?...) syntax, so the re module does too.

> (?iLmsux) # Passing Flags Internally
> This is ridiculous. re's are cryptic enough without inviting TIMTOWDI
> over to play. Passing flags this way does nothing BUT harm
> readability. Please people, pass your flags as an argument to the
> appropriate re.method() and NOT as another cryptic syntax.

1) Not all regular expressions are hard-coded.  Some applications even
allow users to supply regular expressions as data.  Permitting flags
in the regular expression allows the user to specify or override the
defaults set by the application.

2) Permitting flags in the regular expression allows different
combinations of flags to be in effect for different parts of complex
regular expressions.  You can't do that just by passing in the flags
as an argument.

> (?:...) # Non-Capturing Group
> When i look at this pattern "non-capturing" DOES NOT scream out at me,
> and again, the question mark is used incorrectly. When i think of a
> char that screams NEGATIVE, i think of the exclamation mark, NOT the
> question mark. And how the HELL is the colon helping me to interpret
> this syntax?

Don't ask us.  Ask Larry Wall.

> (?=...)  # positive look ahead
> (?!...)  # negative look ahead
> (?<=...) # positive look behind
> (?<!...) # negative look behind
>
> I cannot decipher these patterns in their current syntactical forms.
> Too much information is missing or misleading. I have no idea which
> pattern is looking forward, which pattern is looking backward, which
> is pattern negative, and which pattern is positive. I need syntactical
> clues! Consider these:
>
> (?>=...) #Read as "forward equals pattern?"
> (?>!=...) #Read as "forward NOT equals pattern?"
> (?<=...) #Read as "backwards equals pattern?"
> (?<!=...) #Read as "backwards NOT equals pattern?"
>
> However, i really don't like the fact that negative assertions need
> one extra char than positive assertions. Here is an alternative:
>
> (?>+...) #Read as "forward equals pattern?"
> (?>-...) #Read as "forward NOT equals pattern?"
> (?<+...) #Read as "backwards equals pattern?"
> (?<-...) #Read as "backwards NOT equals pattern?"
>
> Looks much better HOWEVER we still have too much useless noise.
> Replace the parenthesis delimiters with braces, and drop the "where's
> waldo" question mark,  and we have a simplistically intuitive
> syntactical bliss!

Once again, these come from Perl.  Note also that Perl already has
(?>...) which means something entirely different.

> {...}  # Base Extension Syntax
> {iLmsux}  # Passing Flags Internally
> {!()...} or (!...) # Non Capturing.
> {NG=identifier...}  # Named Group Capture
> {NG.name}  # Named Group Reference
> {#...}  # Comment
> {>+...}  # Positive Look Ahead Assertion
> {>-...}  # Negative Look Ahead Assertion
> {<+...}  # Positive Look Behind Assertion
> {<-...}  # Positive Look Behind Assertion
> {(id/name)yes-pat|no-pat}
>
> *school-bell-rings*

Regular expression reform is not necessarily a bad thing, but this is
just forcing everybody to learn Yet Another Regex Syntax for no real
purpose.  All that you've changed here is window dressing.  For an
overview of many of the *real* problems with regular expression
syntax, see

http://www.perl.com/pub/2002/06/04/apo5.html

Ian



More information about the Python-list mailing list