On Sun, 24 Aug 2008, Richard Baron Penman wrote:
I have a document with a format like this: <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example:
from lxml import html doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
[...]
From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text? [...]
I only have 1.3.6 installed, so don't have the HTML support, but you want to use the .tail of the b elements I think. With the XML API: from lxml.etree import fromstring doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') b1, b2 = doc.getchildren() print doc.text + b1.tail + b2.tail John