[Python-bugs-list] [ python-Bugs-476912 ] regex annoyance

noreply@sourceforge.net noreply@sourceforge.net
Wed, 31 Oct 2001 22:45:56 -0800


Bugs item #476912, was opened at 2001-10-31 12:17
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470

Category: Regular Expressions
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Bill Bumgarner (bbum)
Assigned to: Fredrik Lundh (effbot)
Summary: regex annoyance

Initial Comment:
(this may be a feature request-- but it is annoying 
enough that I filed it as a bug)

Python's named sub expressions  within regular 
expressions are an incredibly valuable feature;  
between it and the ability to automatically collapse 
multiline regex's w/comments leads to very 
readable regex's.   

However, there is an annoyance in named 
subexpressions that has bitten me several times.

Namely, if you have a situation where a particular 
token must be parsed out of the input through the 
use of one of two (or more) expressions in a 
fashion that cannot be expressed without multiple 
possible means of matching any given 
subexpression, then the named subexpression 
will only be non-None intermittently (depending on 
expression order and what was matched).

That is, given:

(?:(?<Tok1>[a-z]+)\s(?<Tok2>[a-z]+))|(?:(?<Tok1>
[a-z]+)\t(?<Tok2>[a-z]+))

In this case, Tok1 and Tok2 will be None if the first 
expression matches... 

(Yes, this is a contrived example that could be 
refactored to not use multiple <Tok1>/<Tok2> 
references-- however, more complex expressions 
do not always enable easy refactoring.)

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2001-10-31 22:45

Message:
Logged In: YES 
user_id=31435

Bill, you misunderstand my comment:  I'm not trying to 
solve your problem <wink>.  Named groups were my idea to 
begin with (years ago), and what you want of them is both 
unclear and beyond their intended use.

I'm not suggesting to take *away* "support for multiple 
subexpressions with the same name":  there is no such 
support, only the illusion of support due to the regexp 
compiler failing to raise an exception when a name is 
redefined (that's an old bug, btw:  it's persisted across 
three generations of underlying regexp engine).

Group names are nothing but synonyms for numbered groups; 
they add no power, just convenience.  If you want more than 
that, that's fine, but then you need to specify exactly 
what happens in all cases, and get that implemented.  The 
semantics of named groups right now are defined in terms of 
a trivial bijection with numbered groups, and all you're 
seeing when you repeat a name is implementation accidents 
due to a failure to enforce that there *is* a bijection.

----------------------------------------------------------------------

Comment By: Bill Bumgarner (bbum)
Date: 2001-10-31 18:48

Message:
Logged In: YES 
user_id=103811

While I agree that the proposed solution of raising an exception would certainly be more acceptable behavior than what is occurring now, doing away with support for multiple subexpressions with the same name would be undesirable.

In particular, named subexpressions allow the developer to decouple oneself from counting expressions.   It also allows the developer to not fall into a situation where they have to write a few lines of if/else statements to get the value when it might be in either expression A or expression B.

I would rather an error be raised if two separate instances of named expression A were both defined.   As long as only one matches, then it shouldn't matter that it appears twice.

The goal is to be able to do this|that where this and that both define the same set of named subexpressions.  By definition, only one of this or that will match and, therefore, only one value could be had for a named expression that appears in both this and that.

(As it stands, I have numerous lines of if/else 'this or that' code that generally causes clutter.  It means that the groupdict() cannot be treated as a pure result-- I often have to go through the this/that logic to normalize the groupdict into something that actually represents the results I desired).

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-10-31 18:07

Message:
Logged In: YES 
user_id=31435

Since symbolic names are names *of* integer group numbers, 
the regexp compiler should really raise an exception when 
seeing a given symbolic name defined more than once in a 
regexp.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470