[Python-Dev] Re: pre-PEP [corrected]: Complete,
Structured Regular Expression Group Matching
Mike Coleman
mkc at mathdogs.com
Tue Aug 10 03:38:08 CEST 2004
"Stephen J. Turnbull" <stephen at xemacs.org> writes:
> >>>>> "Mike" == Mike Coleman <mkc at mathdogs.com> writes:
> Mike> m0 = re.match(r'([A-Z]+|[a-z]+)*', 'XxxxYzz')
> Sure, but regexp syntax is a horrible way to express that.
Do you mean, horrible compared to spelling it out using a Python loop that
walks through the array, or horrible compared to some more serious parsing
package?
For the former, I would disagree. I see code like this a lot and it drives me
crazy. Reminds me of the bad old days of building 'while' loops out of 'if's
and 'goto's.
For the latter, I think it depends on the complexity of the matching, and the
level of effort required to learn and distribute the "not-included" parsing
package. I certainly wouldn't want to see someone try to write a language
front-end with this, but for a lot of text-scraping activities, I think it
would be very useful.
> This feature would be an attractive nuisance, IMHO.
I agree that, like list comprehensions (for example), it needs to be applied
with good judgement.
Turning it around, though, why *shouldn't* there be a good mechanism for
returning the multiple matches for multiply matching groups? Why should this
be an exception? If you agree that there should be a mechanism, it certainly
doesn't have to be the one in the PEP, but what would be better? I'd welcome
alternative ideas here.
> Mike> p = r'((?:(?:^|:)([^:\n]*))*\n)*\Z'
>
> This is a _easy_ one, but even it absolutely requires being written
> with (?xm) and lots of comments, don't you think?
I think it's preferable--that's why I did it. :-)
> If you're going to be writing a multiline, verbose regular expression, why
> not write a grammar instead, which (assuming a modicum of library support)
> will be shorter and self-documenting?
If there were a suitable parsing package in the standard library, I agree that
this would probably be a lot less useful.
As things stand right now, though, it's a serious irritation that we have a
standard mechanism that *almost* does this, but quits at the last moment. If
I may wax anthropomorphic, the 're.match' function says to me as a programmer
*You* know what structure this RE represents, and *I* know what
structure it represents, too, because I had to figure it out to
do the match. But too bad, sucker, I'm not going to tell you what
I found!
Irritating as hell.
Mike
More information about the Python-Dev
mailing list