[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Thu Aug 12 12:44:26 CEST 2004

Mike Coleman wrote:
> "Stephen J. Turnbull" <stephen at xemacs.org> writes:

> Re maintenance, yeah regexp is pretty terse and ugly.  Generally, though, I'd
> rather deal with a reasonably well-considered 80 char regexp than 100 lines of
> code that does the same thing.

ditto

>>It's not obvious to me how to make grammar rules pretty in Python, but
>>implementing an SLR parser-generator should be at most a couple of
>>days' work to get something you could live with.
> 
> 
> There are several of these packages available and I'm all in favor of their
> use, particularly for problems of sufficient complexity.  These are currently
> hindered, though, by the fact that none have been elected for inclusion in the
> standard library.
> 
> Furthermore, since we have 're', and it's not going away, I'd really like to
> fix this repeated match deficiency, one way or another.

Well, I guess that if you want structmatch into the stdlib you'll have 
to show that it's better than it's alternatives.  Including those parser 
packages.

>>BTW, do you have a sample implementation for re.structmatch?  Fredrik
>>seems to have some doubt about ease of implementation.
> 
> 
> Erik Heneryd posted an intriguing Python prototype, which I'm still looking
> at.

You'd still have to do a real implementation.  If it can't be done 
without rewriting a whole lot of code, that would be a problem.

>>My objection is that it throws away a lot of structure, and therefore
>>is liable to return the _wrong_ parse, or simply an error with no hint
>>as to where the data is malformed.
> 
> 
> Hmm.  Regarding the lack of error diagnosis, I'm not too concerned about this,
> for the reason I mention above.  When 're.structmatch' does fail, though, it
> returns a "farthest match position", which will usually be of some aid, I
> would think.
> 
> Regarding getting the parse wrong, sure, this could happen.  Users will have
> to be judicious.  The answer here is to write RE's with some care, realize
> that some matches may require further checking after being returned by
> re.structmatch, and further realize that some parsing problems are too complex
> for this method and require a grammar-based approach instead.  This doesn't
> really seem worse than the situation for re.match, though, to me.

Hmm... think this is the wrong approach.  Your PEP is not just about 
"structured matching", it tries to deal with a couple of issues and I 
think it would be better to address them separately, one by one:

* Parsing/scanning - this is mostly what's been discussed so far...

* Capturing repeated groups - IMO nice-to-have (tm) but not something I 
would lose sleep over.  Hard to do.

* Partial matches - would be great for debugging more complex regexes. 
Why not a general re.RAISE flag raising an exception on failure?

* ... ?

Erik