[ python-Bugs-1202493 ] RE parser too loose with {m,n} construct

Thu Jun 2 13:16:01 CEST 2005

Bugs item #1202493, was opened at 2005-05-15 16:59
Message generated for change (Comment added) made by montanaro
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1202493&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Regular Expressions
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: Skip Montanaro (montanaro)
Assigned to: Gustavo Niemeyer (niemeyer)
Summary: RE parser too loose with {m,n} construct

Initial Comment:
This seems wrong to me:

>>> re.match("(UNIX{})", "UNIX{}").groups()
('UNIX',)

With no numbers or commas, "{}" should not be considered
special in the pattern.  The docs identify three numeric
repetition possibilities: {m}, {m,} and {m,n}.  There's no
description of {} meaning anything.  Either the docs should
say {} implies {1,1}, {} should have no special meaning, or
an exception should be raised during compilation of the
regular expression.

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2005-06-02 06:16

Message:
Logged In: YES 
user_id=44345

In the absence of strong technical reasons, I'd vote to do what Perl
does.  I believe the assumption all along has been that most people 
coming to Python who already know how to use regular expressions are 
Perl programmers.  It wouldn't seem to make sense to throw little land
mines in their paths.  I realize that explicit is better than implicit, but
practicality beats purity.

----------------------------------------------------------------------

Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-01 16:32

Message:
Logged In: YES 
user_id=1188172

Okay. Attaching patch which does that.

BTW, these things are currently allowed too (treated as
literals):

"{"
"{x"
"{x}"
"{x,y}"
"{1,x}"
etc.

The patch changes that, too.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2005-06-01 16:07

Message:
Logged In: YES 
user_id=80475

I prefer Skip's third option, raising an exception during
compilation.  This is an re syntax error.  Treat it the same
way that we handle similar situations with regular Python:

>>> a[]
SyntaxError: invalid syntax

----------------------------------------------------------------------

Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-01 15:30

Message:
Logged In: YES 
user_id=1188172

So, should a {} raise an error, or warn like in Ruby?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2005-06-01 15:25

Message:
Logged In: YES 
user_id=80475

IMO, the simplest rule is that braces always be considered
special.  This accommodates future extensions, simplifies
the re compiler, and makes it easier to know what needs to
be escaped.

----------------------------------------------------------------------

Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-06-01 11:54

Message:
Logged In: YES 
user_id=1188172

It's interesting what other RE implementations do with this
ambiguity:
Perl treats {} as literal in REs, as Skip proposes.
Ruby does, too, but issues a warning about } being unescaped.
GNU (e)grep v2.5.1 allows a bare {} only if it is at the
start of a RE, but matches it literally then.
GNU sed v4.1.4 does never allow it.
GNU awk v3.1.4 is gracious and acts like Perl.

Attached is a patch that fixes this behaviour in the
appearing "common sense".

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1202493&group_id=5470