Finding nested expressions in regexs

Tim Peters tim_one at email.msn.com
Fri May 14 20:52:06 EDT 1999


[Stuart I Reynolds, with creative addressing <wink>
> To: python-list at python.org
> Cc: More at python.org; help at python.org; with at python.org;
> simple at python.org; regular at python.org; expression at python.org;
> grouping at python.org; with at python.org; re at python.org

> I need to extract nested expressions contained within paired symbols e.g
> "{ .. }" or "( .. )"
>
>     input                  => matched
> 123{abc}4567               => {abc}
> 123{abc}456}7              => {abc}
> hij{123{abc}4568}def       => {123{abc}4568}
> hij{123{a\{bc}45\}68}def   => {123{a\{bc}45\}68}
>
> Note:
> -  "\{" and "\}" should be completely ignored
> -  Expressions may be nested as above but also repeating:
>
> ab{cd{ef}gh}ijkl{mn{o}}qrstu  => {cd{ef}gh}, {mn{o}}
>
> Is there an easy way to do this with regular expressions?

Sorry, not even a hard way.  "Regular expressions can't count" -- nested
brackets are beyond their abilities.  Easier to see if you simplify it *way*
down, and just try to come up with a regexp that matches these (& only
these) strings:

    ()
    (())
    ((()))
    (((())))
    ...
    "(" * n + ")" * n

> The last point makes it difficult to use a straight-forward
> non-greedy match which would find:
>   {cd{ef}
> in the last example.

Actually, it was impossible long before the last point.

I'll attach a program that solves the problem without regexps.  As is often
the case, it's much wordier than a solution using hairy regexps.  OTOH, it
gives much better error msgs, and also works <wink>.

Note that for your second example it raises an error:

ValueError:
123{abc}456}7
           ^
Error in findtop: no matching open bracket for }

If you want to try to make some sort of sense out of cases like that, think
twice <0.9 wink>.

no-regexp-no-problem-ly y'rs  - tim

def findtoperror(msg, s, i):
    emsg = "\n" + s
    if s[-1:] != "\n":
        emsg = emsg + "\n"
    emsg = emsg + " " * i + "^\n"
    emsg = emsg + "Error in findtop: " + msg + "\n"
    raise ValueError(emsg)

# Return list of (start, stop) indices for top-level bracketed expressions.

def findtop(s, openers="([{", closers=")]}"):
    import string
    if len(openers) != len(closers):
        raise ValueError("openers and closers must be same length")
    startstack, answer = [], []
    i, n = 0, len(s)
    while i < n:
        ch = s[i]
        if ch == "\\":
            i = i + 1
            if i >= n:
                findtoperror("lone backslash at end", s, i-1)

        elif ch in openers:
            startstack.append(i)

        elif ch in closers:
            if len(startstack) == 0:
                findtoperror("no matching open bracket for " + ch, s, i)
            starti = startstack.pop()
            opener = s[starti]
            expected_closer = closers[string.find(openers, opener)]
            if expected_closer != ch:
                findtoperror("expected " + expected_closer, s, i)
            if len(startstack) == 0:
                answer.append( (starti, i+1) )

        i = i + 1

    if startstack:
        msg = ["unclosed"]
        for i in startstack:
            msg.append("%s at %d," % (s[i], i))
        findtoperror(string.join(msg)[:-1], s, n-1)

    return answer

# like findtop, except return list of substrings instead of index pairs

def findtopstring(s, openers="([{", closers=")]}"):
    answer = []
    for i, j in findtop(s, openers, closers):
        answer.append(s[i:j])
    return answer






More information about the Python-list mailing list