[Python-Dev] Re: pre-PEP [corrected]: Complete, Structured Regular Expression Group Matching

Stephen J. Turnbull stephen at xemacs.org
Wed Aug 11 10:51:29 CEST 2004


>>>>> "Mike" == Mike Coleman <mkc at mathdogs.com> writes:

    Mike> "Stephen J. Turnbull" <stephen at xemacs.org> writes:

    >> >>>>> "Mike" == Mike Coleman <mkc at mathdogs.com> writes:

    Mike> m0 = re.match(r'([A-Z]+|[a-z]+)*', 'XxxxYzz')

    >> Sure, but regexp syntax is a horrible way to express that.

    Mike> Do you mean, horrible compared to spelling it out using a
    Mike> Python loop that walks through the array,

I didn't have that in mind, but I can live with it.  Regexp doesn't
compare well even with that when you consider error-checking and
program maintenance.

    Mike> or horrible compared to some more serious parsing package?

That's more like it.  It doesn't have to be all that serious, either.

It's not obvious to me how to make grammar rules pretty in Python, but
implementing an SLR parser-generator should be at most a couple of
days' work to get something you could live with.

Note that one thing about the LR parser itself is that the stack
machine is quite generic; you can start with SLR, and go on to add
more powerful parser table generators later.

    Mike> I certainly wouldn't want to see someone try to write a
    Mike> language front-end with this, but for a lot of text-scraping
    Mike> activities, I think it would be very useful.

I agree it would be useful.  I think that a parser generator would be
equally useful for this, as easy to learn as regexps are (assuming
somebody comes up with a tasteful way to express grammar rules), and
more powerful with even a basic implementation.

BTW, do you have a sample implementation for re.structmatch?  Fredrik
seems to have some doubt about ease of implementation.

    >> This feature would be an attractive nuisance, IMHO.

    Mike> I agree that, like list comprehensions (for example), it
    Mike> needs to be applied with good judgement.

I don't see the analogy to list comprehensions.  My objection is not
that re.structmatch can express a request for a huge amount of data in
compact form.  If you've got the RAM, go right ahead.

My objection is that it throws away a lot of structure, and therefore
is liable to return the _wrong_ parse, or simply an error with no hint
as to where the data is malformed.  (AFAICS this is the practical
implication of Fredrik's "theoretical" statement "the RE engine is
designed to tell you if a string you have is a member of this set".
Yes, the Python re engine can provide some annotations for a well-formed
string, but in the end, well-formedness of the string is yes/no, with
no error diagnosis.)

Sure, there are applications like reading a matrix where the structure
is as uniform as the "*" operator implies.  But what do you get from
that?  A list of lists of _strings_ ... and you need a nested loop to
turn those strings into numbers that your linear algebra package can
use.  No?  I just don't see the benefits in code reduction you claim.

    Mike> Turning it around, though, why *shouldn't* there be a good
    Mike> mechanism for returning the multiple matches for multiply
    Mike> matching groups?

Because they're all too often not "multiple matches".  They're matches
for different objects that happen to match the same regular
expression, and the notation for the parser should express that
without requiring comments.  re.structmatch _encourages_ a style where
users deliberately throw away structure; your /etc/group parser is an
example.  (See below.)

Furthermore, since this is going to be recursive, users are going to
end up with "list of list of ..." nested to arbitrary depth, depending
on how complex the regexp is.  Isn't that what we have Lisp for? <wink>

    Mike> As things stand right now, though, it's a serious irritation
    Mike> that we have a standard mechanism that *almost* does this,
    Mike> but quits at the last moment.  If I may wax anthropomorphic,
    Mike> the 're.match' function says to me as a programmer

    Mike>     *You* know what structure this RE represents, and *I*
    Mike> know what structure it represents, too, because I had to
    Mike> figure it out to do the match.  But too bad, sucker, I'm not
    Mike> going to tell you what I found!

    Mike> Irritating as hell.

Problem is, you (for some values of "you") and the program both _know_
what structure it represents, but you have different structures in
mind.  This is a standard source of regexp bugs.  The structmatch
extension will exacerbate the problem.

I suppose you would be satisfied if you could represent

file := file line
line := group ':' passwd ':' number ':' any '\n'
group := '[A-Za-z0-9]+'
passwd := '[A-Za-z0-9]+'
number := '[0-9]+'
any := '.*'

by

'(([A-Za-z0-9]+:)*.*\n)*'

in which case I agree it's pretty harmless.  But note well, you've
already allowed arbitrary words instead of numbers for 'number', and
improving 'passwd' to check for the characteristic signature of a
crypted password or a "no password authentication" flag would be
annoying with re.structmatch, while it's trivial with a parser.

You say, "that's a matter of application of good judgement".  My
experience is that when it comes to regexps, application of good
judgement is pretty rare.  I have a lot of experience with parsing
done by regexp, none of it pleasant.  Have some pity on the c.l.py
regulars who will be debugging the monstrosities that some people will
come up with! <0.5 wink>

I grant that a full-blown parser generator is overkill for your target
set of applications.  But I stick with my judgement that regexps as
they are are an attractive nuisance, and re.structmatch would make the
situation worse in proportion to its usage.  I hope your proposal will
be superseded either by a scanner package, or even by a parser package.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list