Parsing markup.

Mon Nov 29 11:20:20 EST 2010

Hi MRAB,

I was trying to avoid regex because my poor old brain has trouble with it. I
have to admin though, that line is slick!  I'll have to go through my regex
documentation to try and figure out what it actually means.

Thanks!

-----Original Message-----
From: python-list-bounces+joe=goldthwaites.com at python.org
[mailto:python-list-bounces+joe=goldthwaites.com at python.org] On Behalf Of
MRAB
Sent: Thursday, November 25, 2010 9:03 PM
To: python-list at python.org
Subject: Re: Parsing markup.

On 26/11/2010 03:28, Joe Goldthwaite wrote:
 > I'm attempting to parse some basic tagged markup.  The output of the
 > TinyMCE editor returns a string that looks something like this;
 >
 > <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in
 > it</p><p>It can be made up of multiple lines separated by pagagraph
 > tags.</p>
 >
 > I'm trying to render the paragraph into a bit mapped image.  I need
 > to parse it out into the various paragraph and bold/italic pieces.
 > I'm not sure the best way to approach it.  Elementree and lxml seem
 > to want a full formatted page, not a small segment like this one.
 > When I tried to feed a line similar to the above to lxml I got an
 > error; "XMLSyntaxError: Extra content at the end of the document".
 >
I'd probably use a regex:

 >>> import re
 >>> text = "<p>This is a paragraph with <b>bold</b> and <i>italic</i> 
elements in it</p><p>It can be made up of multiple lines separated by 
pagagraph tags.</p>"
 >>> re.findall(r"</?\w+>|[^<>]+", text)
['<p>', 'This is a paragraph with ', '<b>', 'bold', '</b>', ' and ', 
'<i>', 'italic', '</i>', ' elements in it', '</p>', '<p>', 'It can be 
made up of multiple lines separated by pagagraph tags.', '</p>']
-- 
http://mail.python.org/mailman/listinfo/python-list