Parsing of nested tags
Felix Thibault
felixt at dicksonstreet.com
Mon Mar 6 19:45:00 EST 2000
At 00:03 3/4/00 +0100, Stefan Schwarzer wrote:
>Hello :-)
>
>Some time ago I have written one of the zillion programs that read some
>kind of format file(s) and make HTML from them. The program is able to
>convert <<I italic text>> to <I>italic text</I> or <<LINK link;text>> to
><A HREF="link">text</A>.
>
>However, currently I can't convert <<LINK link;<<I italic text>>>> to
><A HREF="link"><I>italic text</I></A>. The relevant code is
>
>-----8<---------------------------------------------------------------
>
>######################################################################
># perform substitutions
># <<link url;url_text>>, url_text defaults to url
>link_pattern = re.compile( '(?si)<<link (.+?)(?:;(.*?))?>>' )
>
>def make_link( matchobj ):
>
> url, url_text = matchobj.groups()
> if not url_text: # use url as url_text by default
> url_text = url
> url, url_text = map( string.strip, [ url, url_text ] )
>
> return string.join( (
> html_format.link_format[ 0 ],
> url,
> html_format.link_format[ 1 ],
> url_text,
> html_format.link_format[ 2 ] ), '' )
>
># evaluate some formatting in the text to legal code
>def make_html( text ):
> # order matters, - conversion to links has to be come first
> text = re.sub( link_pattern, make_link, text )
> text = re.sub( r'<<(\S+)\s(.*?)>>', r'<\1>\2</\1>', text )
> text = re.sub( r'(?i)<PROG>(.*?)</PROG>', r'<EM>\1</EM>', text )
> text = re.sub( r'(?i)<FILE>(.*?)</FILE>', r'<EM>\1</EM>', text )
> text = re.sub( r'(?i)<OPT>(.*?)</OPT>', r'<STRONG>\1</STRONG>', text )
> return text
>
>-----8<---------------------------------------------------------------
>
>Now the question: Which is the best way to enable parsing of recursive
>parsing as mentioned in the example above?
>
>So far I have thought of two ways. One may be to extend the regular
>expression(s), but this is already cumbersome to read. The other
>possibility would be to scan the string and replace <<...>> occurences
>which don't contain <<, perhaps multiple times, until all patterns are
>substituted.
A while back I had read that you can't do arbitrarily deeply nested
stuff with just a re, and I came up with this (re-less) way:
-find where every opening and closing delimiter is in the text:
#a func like this is in TextTools, but I couldn't find it in the #standard
library...
def findall(text, substring):
"""findall(text, substring) -> slices of text substring occurs in.
"""
end, slices, size = 0, [], len(substring)
for piece in string.split(text, substring)[:-1]:
start = end+len(piece)
end = start+size
slices.append((start, end))
return slices
openslices, closeslices = findall(text, '<<'), findall(text, '>>')
-merge these two lists, and make a dictionary so you can tell if an
index means open a new tag or close an old one
def merge_n_sort(*klvtupes):
"""merge_n_sort(*klvtupes) -> sortedlist, dict
Take in a set of 2-ples made of a list of slices and
a delimiter and return a sorted list of all left
indices and a {left: delimiter} dictionary."""
indexdict = {}
for keylist, delimiter in klvtupes:
for left, right in keylist:
indexdict[left] = delimiter
indices = indexdict.keys()
indices.sort()
return indices, indexdict
indices, indexdict = merge_n_sort((openslices, '<<'),
(closeslices, '>>'))
-Then you just do something like:
stack, out = (), []
for i in range(len(indices)):
if indexdict[indices[i]] == '<<':
...handle opening the tag and
stack = tagname, stack #push...
else: #indexdict[indices[i]] == '>>':
try:
tagname, stack = stack #...pop
...handle closing the tag
except ValueError: #eg: (mismatched)parens)
pass
return string.join(out)
(nb: you have to re-write findall if you have escaped stuff
this is not-very-tested code.)
>I hope there is an easy way that I simply have overlooked 8-) .
>Any suggestions are appreciated. Thank you in advance :) .
>
>Stefan
>--
>http://www.python.org/mailman/listinfo/python-list
>
>
hopefully-helpfully y'rs,
Felix
More information about the Python-list
mailing list