[Tutor] mxTextTools help: find the deepest in nest lists

Wed Dec 10 17:41:40 EST 2003

On Wed, 10 Dec 2003 pan at uchicago.edu wrote:

> I did try the re approach earlier, but I got an "maximum recursion"
> error when the list gets complicated. I guess it is not good for large
> nested list.

Hi Pan,

Ah!  That problem is due to the nongreediness (?) in the regex.  But
Python 2.3's regex engine is supposed to have added a few special cases
for nongreedy matches to avoid the recursion problem.  Do you still get
the maximum recursion error with 2.3?

Anyway, it turns out that we can rewrite the regex to make it work without
the nongreediness; it should perform a lot better for you.  Try:

###
def find_innermost_nodes(tree_string):
    regex = re.compile(r"""\[          ## literal open brace
                           [^\[\]]+    ## one or more non-brace
                                       ## characters
                           \]          ## literal close brace
                      """, re.VERBOSE)
    return regex.findall(tree_string)
###

It works correctly for those cases that you've given us:

###
>>> find_innermost_nodes("[[[1,[2, 2a]],[[3,3b],4]],6]")
['[2, 2a]', '[3,3b]']
>>> find_innermost_nodes("[[a.1,b_2],[[c,[d.3,e]],f]]")
['[a.1,b_2]', '[d.3,e]']
>>> find_innermost_nodes("[[[[[a1,a2],out],[d.1, d.2]],o2],[[o3,o4],o5]]")
['[a1,a2]', '[d.1, d.2]', '[o3,o4]']
###

and avoids the problems with recursion --- it should work a lot better on
complex binary trees strings.

What's the data format here that you're working with?  The example that
you've given us looks similar to the "newick" tree format,

    http://evolution.genetics.washington.edu/phylip/newicktree.html

which is used in applications like phylogenetic trees.  The traditional
approach for processing these structures is to use a module to do the
parsing for us, maybe something like:

    http://www.daimi.au.dk/~mailund/newick.html

Most people do not use regular expressions to directly process structured
data, but use an auxillary parser to let them take advantage of the data
structure.

Just as we try to discourage folks from doing HTML parsing with regular
expressions alone, we're trying to discourage you from doing tree
processing with regular expressions alone.  *grin* Tell us more about the
format, and we can look for a parser for it.

> The tagtable concept is quite different from re. Basically it is a
> collection of 'states'.

Regular expressions are, in fact, state machines.  There's a direct
equivalence between the two.  If you want to learn more, there's a very
nice introduction in the book "Introduction to the Theory of Computation":

    http://www-math.mit.edu/~sipser/book.html

As a concrete example, the regular expression:

    a+b

can be represented as a finite state machine with three states and three
arrows:

                     -\
                    / | a
                    V /
    +-----+   a    +--+    b    +------+
    |START|------->|  |-------->|FINISH|
    +-----+        +--+         +------+

The second arrow loops the second state to itself, and is responsible for
the repetition of the '*' regex.

> When parsing, the core engine reads through the target text, check if
> the read text match a state. Defined in each state is a decision tree,
> which directs which state will become the next state when matched and
> which when failed.

Hmmm... Ok, I will definitely have to read:

    http://www.lemburg.com/files/python/mxTextTools.html

closely; I'll take a look later tonight and see why your tag table isn't
working the way you expect.

I'm sure mxTextTools can handle this problem, but it does seem like a very
low level approach.  mxTextTools allows us to manually construct the state
machine table, but 're' does this for us on a larger scale, handling the
states and table details for us.

Talk to you later!