[Python-bugs-list] [ python-Bugs-476912 ] regex annoyance
noreply@sourceforge.net
noreply@sourceforge.net
Fri, 02 Nov 2001 06:01:09 -0800
Bugs item #476912, was opened at 2001-10-31 12:17
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470
Category: Regular Expressions
Group: None
>Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Bill Bumgarner (bbum)
Assigned to: Fredrik Lundh (effbot)
Summary: regex annoyance
Initial Comment:
(this may be a feature request-- but it is annoying
enough that I filed it as a bug)
Python's named sub expressions within regular
expressions are an incredibly valuable feature;
between it and the ability to automatically collapse
multiline regex's w/comments leads to very
readable regex's.
However, there is an annoyance in named
subexpressions that has bitten me several times.
Namely, if you have a situation where a particular
token must be parsed out of the input through the
use of one of two (or more) expressions in a
fashion that cannot be expressed without multiple
possible means of matching any given
subexpression, then the named subexpression
will only be non-None intermittently (depending on
expression order and what was matched).
That is, given:
(?:(?<Tok1>[a-z]+)\s(?<Tok2>[a-z]+))|(?:(?<Tok1>
[a-z]+)\t(?<Tok2>[a-z]+))
In this case, Tok1 and Tok2 will be None if the first
expression matches...
(Yes, this is a contrived example that could be
refactored to not use multiple <Tok1>/<Tok2>
references-- however, more complex expressions
do not always enable easy refactoring.)
----------------------------------------------------------------------
Comment By: Fredrik Lundh (effbot)
Date: 2001-10-31 23:50
Message:
Logged In: YES
user_id=38376
This will be fixed (as in "explicitly disallowed")
in 2.2b2.
(but I guess it's time to start thinking about building
a better framework on top of SRE. after all, the engine
itself can do what Bill wants...)
</F>
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-10-31 22:45
Message:
Logged In: YES
user_id=31435
Bill, you misunderstand my comment: I'm not trying to
solve your problem <wink>. Named groups were my idea to
begin with (years ago), and what you want of them is both
unclear and beyond their intended use.
I'm not suggesting to take *away* "support for multiple
subexpressions with the same name": there is no such
support, only the illusion of support due to the regexp
compiler failing to raise an exception when a name is
redefined (that's an old bug, btw: it's persisted across
three generations of underlying regexp engine).
Group names are nothing but synonyms for numbered groups;
they add no power, just convenience. If you want more than
that, that's fine, but then you need to specify exactly
what happens in all cases, and get that implemented. The
semantics of named groups right now are defined in terms of
a trivial bijection with numbered groups, and all you're
seeing when you repeat a name is implementation accidents
due to a failure to enforce that there *is* a bijection.
----------------------------------------------------------------------
Comment By: Bill Bumgarner (bbum)
Date: 2001-10-31 18:48
Message:
Logged In: YES
user_id=103811
While I agree that the proposed solution of raising an exception would certainly be more acceptable behavior than what is occurring now, doing away with support for multiple subexpressions with the same name would be undesirable.
In particular, named subexpressions allow the developer to decouple oneself from counting expressions. It also allows the developer to not fall into a situation where they have to write a few lines of if/else statements to get the value when it might be in either expression A or expression B.
I would rather an error be raised if two separate instances of named expression A were both defined. As long as only one matches, then it shouldn't matter that it appears twice.
The goal is to be able to do this|that where this and that both define the same set of named subexpressions. By definition, only one of this or that will match and, therefore, only one value could be had for a named expression that appears in both this and that.
(As it stands, I have numerous lines of if/else 'this or that' code that generally causes clutter. It means that the groupdict() cannot be treated as a pure result-- I often have to go through the this/that logic to normalize the groupdict into something that actually represents the results I desired).
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2001-10-31 18:07
Message:
Logged In: YES
user_id=31435
Since symbolic names are names *of* integer group numbers,
the regexp compiler should really raise an exception when
seeing a given symbolic name defined more than once in a
regexp.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=476912&group_id=5470