[Python-Dev] pre-PEP: Complete, Structured Regular Expression Group Matching

Fri Aug 6 04:46:51 CEST 2004

If you've ever been frustrated that there wasn't an easy way to parse a string
or file, even though you could easily match it with re.match, then this PEP
may be for you.  If what's being proposed seems a little ghastly, well, I'll
just say that I've been wanting something like this for a long time and this
is the best I could come up with.  Your comments invited.

Mike

##############################################################################
PEP: XXX
Title: Complete, Structured Regular Expression Group Matching
Version: $Revision: 1.3 $
Last-Modified: $Date: 2004/08/02 05:18:31 $
Author: Mike Coleman <mkc at mathdogs.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-Aug-2004
Python-Version: 2.4
Post-History: 1-Aug-2004

Abstract
========

This proposal is to add a new method ``re.structmatch`` that fully
captures matches for groups that match repeatedly.  The method returns
a structured match tree whose shape is determined by the pattern
groups.  Ultimately this will allow a string that can be described by
a Python regular expressions (e.g., the contents of ``/etc/passwd`` or
``.ini`` files, or the output of ``diff``) to be parsed into the
obvious parse tree with a single call to ``re.structmatch``.

Motivation
==========

A notable limitation of the ``re.match`` method is that it fails to
capture all group match information for repeatedly matched groups.
For example, in a call like this ::

    m0 = re.match(r'([A-Z]|[a-z])*', 'XxxxYzz')

one would like to see that the group which matched four times matched
the strings ``'X'``, ``'xxx'``, ``'Y'`` and ``'zz'``.  In the current
implementation, only the last match for each group (``'zz'`` in this
case) is available, even though the other matches are calculated
implicitly during the matching process.  For simple cases, the missing
strings might be easily recovered by other means, but for more
complicated cases this is a significant annoyance.

The simplest extension would be to collect all of the matches for each
group in a list, so that in the call above for example ::

    m0.group(1) --> ['X', 'xxx', 'Y', 'zz']

(A mechanism like this is apparently to be added to Perl in version 6
[#PERL6]_, though that did not inspire this PEP.)

This immediately suggests the question of what is to be done for
nested repeats, like so ::

    m1 = re.match(r'("([A-Z]|[a-z])*"\s*)*', '"Xx" "yy" "ZzZ"')

We could have ::

    m1.group(2) --> ['X', 'x', 'yy', 'Z', 'z', 'Z' ]

but this flattened result would discard useful information about the
structure of the match.  A more useful result would be ::

    m1.group(2) --> [['X', 'x'], ['yy'], ['ZzZ']]

This is definitely an improvement.  Consider the following slightly
more complicated case, though ::

    m1 = re.match(r'("([A-Z]|[a-z])*"(\s*))*', '"Xx" "yy" "ZzZ"')

which would give ::

    m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']]
    m1.group(3) --> [' ', ' ', '']

This is less useful than the putative conceptual structure of the
match, which would be something like ::

    [[['X', 'x'], ' '],
     [['yy'], ' '], 
     [['Z', 'z', 'Z'], '']]

In this simple case, this structure could be recovered using the
``zip`` function, but in the general case this is a non-trivial
transformation.

This PEP proposes a mechanism to address the general case directly.

Proposal
========

The new function ``re.structmatch`` has the same arguments as
``re.match`` and will match using exactly the same algorithm.  Instead
of returning a MatchObject upon success, ``re.structmatch`` will
return a "tree" (or rather, a forest) in the form of a Python list.

First we will define some terms.  Groups in the input pattern are
parenthesized subexpressions that "capture" strings when matched, and
are of the form ``'(abc)'`` or ``'(?P<name>abc)'``.  Leaf groups are
groups that do not contain any subgroups.  So, for example, the group
``'(abc)'`` is a leaf group and the group ``'(a(xyz)b)'`` is not a
leaf group (but it does contain the leaf group ``'(xyz)'``).  We shall
call parenthesized expressions that are not groups "nongroups"; these
include constructs like ``'(?:abc)'`` as well as lookahead and
lookbehind assertions.

The structure of the returned tree is determined from the grouping
tree present in the pattern in the following manner:

* Leaf groups not followed immediately by a repeat operator map to a
  single string::

    re.structmatch(r'(...)', 'abcdef') --> ['abc']
    re.structmatch(r'(...).(..)', 'abcdef') --> ['abc', 'ef']

* Leaf groups followed immediately by a repeat operator map to a list
  of strings::

    re.structmatch(r'([^d])*', 'abcdef') --> [['a', 'b', 'c']]
    re.structmatch(r'([^d])*(.)*', 'abcdef')
        --> [['a', 'b', 'c'], ['d', 'e', 'f']]
    re.structmatch(r'(..)*', 'abcdef') --> [['ab', 'cd', 'ef']]
    re.structmatch(r'(.){2}', 'abcdef') --> [['a', 'b']]

* Non-leaf groups not followed immediately by a repeat operator map to
  a list of the mappings of their subgroups::

    re.structmatch(r'(...)', 'abcdef') --> ['abc']
    re.structmatch(r'((...))', 'abcdef') --> [['abc']]
    re.structmatch(r'(((...)))', 'abcdef') --> [[['abc']]]
    re.structmatch(r'((...)())', 'abcdef') --> [['abc'], []]
    re.structmatch(r'(.(.)(.(.)).(.))', 'abcdef')
       --> ['a', ['b'], ['c', ['d']], 'e', ['f']]

* Non-leaf groups followed immediately by a repeat operator map to a
  list of the mappings of their repeated matches::

    re.structmatch(r'((.).(.))*', 'abcdef') --> [[['a', 'c'], ['d', 'f']]]
    re.structmatch(r'(([ab])*(x)*)*', 'baxbxx')
       --> [[['b', 'a'], ['x']], [['b'], ['x', 'x']]]

* In the case of alternation, only the matched groups appear in the
  output::

    re.structmatch(r'([^a])*|([^d])*', 'abcdef') --> [['a', 'b', 'c']]

  If it's important to know which alternative matched, named groups
  can be used.

* Named groups map to a pair where the first member is the name and
  the second member is what the unnamed group would have mapped to::

    re.structmatch(r'([^d])*(?P<rest>.)*', 'abcdef')
        --> [['a', 'b', 'c'], ('rest', ['d', 'e', 'f'])]

* Nongroups do not affect the output structure.  Compared to non-leaf
  groups, nongroups have the effect of "flattening" the output, like
  so::

    re.structmatch(r'((.).(.))', 'abcdef') --> [['a', 'c']]
    re.structmatch(r'(.).(.)', 'abcdef') --> ['a', 'c']
    re.structmatch(r'(?:(.).(.))', 'abcdef') --> ['a', 'c']

    re.structmatch(r'((.).(.))*', 'abcdef') 
        --> [[['a', 'c'], ['d', 'f']]]
    re.structmatch(r'(?:(.).(.))*', 'abcdef') 
        --> [['a', 'c', 'd', 'f']]

  (Nongroups that do not contain leaf groups have no effect whatsoever
  on the output structure.)

Additional Features
-------------------

In many cases in which ``'re.structmatch'`` fails to match, the cause will
be due to an unexpected error in the format of the string being
matched.  In order to assist the calling program in generating a more
meaningful possible error message, ``'re.structmatch'`` will return the
endpoint of the largest match against the searched string.  So, for
example ::

        re.structmatch('abcd', 'abxxx') --> 2
        re.structmatch('abcde|z', 'abxxx') --> 2
        re.structmatch('x*?y', 'xxx') --> 3

(This return value should be considered advisory rather than exact, as
future improvements in the match algorithm may make it difficult to
calculate the exact value.)

Examples
========

Parsing ``/etc/group``
----------------------

If ``/etc/group`` contains the following lines ::

    root:x:0:
    daemon:x:1:
    bin:x:2:
    sys:x:3:

then it can be parsed as follows ::

    p = r'''(?xm)           # VERBOSE, MULTILINE
            (
              (
                # one field, preceded by a ':' if
                # the field is not the line's first
                (?:^|:) ([^:\n]*)
              )*
              \n
            )*
         '''
    s = open('/etc/group').read()
    tree = re.structmatch(p, s)

which will give ``tree`` the following value::

    [['root', 'x', '0', ''],
     ['daemon', 'x', '1', ''],
     ['bin', 'x', '2', ''],
     ['sys', 'x', '3', '']]

Note that the above pattern is written to allow any ``/etc/group``
field to be empty.  The pattern won't work correctly for versions of
Python prior to 2.4 because of the possibility of repeating empty
matches.  This problem seems to have been fixed in Python 2.4.  (A
slightly more complicated pattern would work for pre-2.4 versions.)

Parsing ``.ini`` files
----------------------

The Microsoft ``.ini`` file format is found in various contexts (there
is one in the Python source distribution:
``Tools/freeze/extensions_win32.ini``).  Given a file with contents ::

    [singers]
    good=Annie Lennox
    bad=William Shatner

    [comedians]
    good=Monty Python

the file can be parsed with pattern ::

    r'''(?xm)           # VERBOSE, MULTILINE
        \s*             # leading whitespace
        (               # begin sections
          ^\[ ([^]]+) \] \s*  # section header
          (             # begin params
            ^([^=]+) =  # param name
            (.*) $      # param value
            \s*         # intervening whitespace
          )*            # end params
        )*              # end sections
        \Z              # assert that the entire
                        #   input was matched
     '''

to give ::

    [['singers',
      ['good', 'Annie Lennox'],
      ['bad', 'William Shatner']],
     ['comedians',
      ['good', 'Monty Python']]]

The pattern given omits support for ``.ini`` file comments for the
sake of simplicity, but this could be added.

Details
=======

The proposal states that ``re.structmatch`` will match using exactly
the same algorithm as ``re.match``.  This might be a little odd for a
pattern like ``r'(x|y|z)*aaa\1'``, where the algorithm will require
that the ``\1`` back-reference match (at most) one character.  It's
not obvious whether there is any better option, though, and there are
advantages of implementation and simplicity for keeping the match
algorithms of these two methods identical.  (A similar argument
applies to ``'(?P=NAME)'``.)

Discussion
==========

Part of the reason the proposed method would be so useful is that
``re.split`` currently does not split on empty matches.  If it had
this feature, one could split on lookahead and lookbehind assertions,
which would significantly ease parsing of strings with a recursive
regular structure (e.g., ``.ini`` files).  Patch `#988761`_ will
correct this ``re.split`` deficiency, if it is accepted.

.. _#988761: https://sourceforge.net/tracker/?func=detail&aid=988761&group_id=5470&atid=305470

For particularly complex patterns, the current 99 group limit might
actually become a practical problem.

One could imagine a variation in which subtrees of named groups might
be captured in dictionaries rather than lists, with the group names
used as keys.

Rejected Alternatives
=====================

Several simpler alternatives are rejected in the `Motivation`_ section
above.  Although these alternatives would be better than nothing, they
would not adequately satisfy the goal of this proposal, which is to
allow the programmer to extract the structure of a string in an
immediate manner.

It would be possible to use tuples for some substructures in the
return tree, rather than composing it strictly of lists.  This
alternative was rejected because it was thought useful to be able to
modify the resulting tree in place, perhaps to add decorations, etc.,
and tuples would make this more difficult.

References
==========

.. [#PERL6] Multiple regex matches in Perl Apocalypse 5
   (http://www.perl.com/pub/a/2002/06/04/apo5.html?page=22#rfc%20360:%20allow%20multiply%20matched%20groups%20in%20regexes%20to%20return%20a%20listref%20of%20all%20matches)

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: