Parsing of nested tags

Mon Mar 6 19:45:00 EST 2000

At 00:03 3/4/00 +0100, Stefan Schwarzer wrote:
>Hello :-)
>
>Some time ago I have written one of the zillion programs that read some
>kind of format file(s) and make HTML from them. The program is able to
>convert <<I italic text>> to <I>italic text</I> or <<LINK link;text>> to
><A HREF="link">text</A>.
>
>However, currently I can't convert <<LINK link;<<I italic text>>>> to
><A HREF="link"><I>italic text</I></A>. The relevant code is
>
>-----8<---------------------------------------------------------------
>
>######################################################################
># perform substitutions
>#   <<link url;url_text>>, url_text defaults to url
>link_pattern = re.compile( '(?si)<<link (.+?)(?:;(.*?))?>>' )
>
>def make_link( matchobj ):
>
>    url, url_text = matchobj.groups()
>    if not url_text:                    # use url as url_text by default
>        url_text = url
>    url, url_text = map( string.strip, [ url, url_text ] )
>
>    return string.join( (
>      html_format.link_format[ 0 ],
>      url,
>      html_format.link_format[ 1 ],
>      url_text,
>      html_format.link_format[ 2 ] ), '' )
>
># evaluate some formatting in the text to legal code
>def make_html( text ):
>    # order matters, - conversion to links has to be come first
>    text = re.sub( link_pattern, make_link, text )
>    text = re.sub( r'<<(\S+)\s(.*?)>>', r'<\1>\2</\1>', text )
>    text = re.sub( r'(?i)<PROG>(.*?)</PROG>', r'<EM>\1</EM>', text )
>    text = re.sub( r'(?i)<FILE>(.*?)</FILE>', r'<EM>\1</EM>', text )
>    text = re.sub( r'(?i)<OPT>(.*?)</OPT>', r'<STRONG>\1</STRONG>', text )
>    return text
>
>-----8<---------------------------------------------------------------
>
>Now the question: Which is the best way to enable parsing of recursive
>parsing as mentioned in the example above?
>
>So far I have thought of two ways. One may be to extend the regular
>expression(s), but this is already cumbersome to read. The other
>possibility would be to scan the string and replace <<...>> occurences
>which don't contain <<, perhaps multiple times, until all patterns are
>substituted.

A while back I had read that you can't do arbitrarily deeply nested
stuff with just a re, and I came up with this (re-less) way:

-find where every opening and closing delimiter is in the text:

#a func like this is in TextTools, but I couldn't find it in the #standard
library...

def findall(text, substring):
    """findall(text, substring) -> slices of text substring occurs in.
    """
    end, slices, size = 0, [], len(substring)
    for piece in string.split(text, substring)[:-1]:
        start = end+len(piece)			    
        end = start+size
        slices.append((start, end))
    return slices

openslices, closeslices = findall(text, '<<'), findall(text, '>>')

-merge these two lists, and make a dictionary so you can tell if an
 index means open a new tag or close an old one

def merge_n_sort(*klvtupes):
    """merge_n_sort(*klvtupes) -> sortedlist, dict

    Take in a set of 2-ples made of a list of slices and 
    a delimiter and return a sorted list of all left 
    indices and a {left: delimiter} dictionary."""
    indexdict = {}
    for keylist, delimiter in klvtupes:
        for left, right in keylist:
            indexdict[left] = delimiter
    indices = indexdict.keys()
    indices.sort()
    return indices, indexdict

indices, indexdict = merge_n_sort((openslices, '<<'),
                                  (closeslices, '>>'))

-Then you just do something like:

stack, out = (), []
for i in range(len(indices)):

    if indexdict[indices[i]] == '<<':
        ...handle opening the tag and
        stack = tagname, stack  #push...

    else:  #indexdict[indices[i]] == '>>':
        try:
            tagname, stack = stack  #...pop
            ...handle closing the tag
        except ValueError:    #eg: (mismatched)parens)
            pass      
return string.join(out)

(nb: you have to re-write findall if you have escaped stuff
     this is not-very-tested code.)

>I hope there is an easy way that I simply have overlooked 8-) .
>Any suggestions are appreciated. Thank you in advance :) .
>
>Stefan
>-- 
>http://www.python.org/mailman/listinfo/python-list
>
>

hopefully-helpfully y'rs,
Felix