Parsing markup.
Stefan Behnel
stefan_ml at behnel.de
Mon Nov 29 11:34:10 EST 2010
Jon Clements, 26.11.2010 13:58:
> On Nov 26, 4:03 am, MRAB<pyt... at mrabarnett.plus.com> wrote:
>> On 26/11/2010 03:28, Joe Goldthwaite wrote:
>> > I’m attempting to parse some basic tagged markup. The output of the
>> > TinyMCE editor returns a string that looks something like this;
>> >
>> > <p>This is a paragraph with<b>bold</b> and<i>italic</i> elements in
>> > it</p><p>It can be made up of multiple lines separated by pagagraph
>> > tags.</p>
>> >
>> > I’m trying to render the paragraph into a bit mapped image. I need
>> > to parse it out into the various paragraph and bold/italic pieces.
>> > I’m not sure the best way to approach it. Elementree and lxml seem
>> > to want a full formatted page, not a small segment like this one.
>> > When I tried to feed a line similar to the above to lxml I got an
>> > error; “XMLSyntaxError: Extra content at the end of the document”.
This exception indicates that the OP is using the XML parser.
> lxml works fine for me - have you tried:
>
> from lxml import html
> text = "<p>This is a paragraph with<b>bold</b> and<i>italic</i>
> elements in it</p><p>It can be made up of multiple lines separated by
> pagagraph tags.</p>"
> tree = html.fromstring(text)
> print tree.findall('p')
> # should print [<Element p at 2b7b458>,<Element p at 2b7b3e8>]
Yep, either use lxml.etree's HTML parser or lxml.html.
Stefan
More information about the Python-list
mailing list