24 Aug
2008
24 Aug
'08
4:19 a.m.
hello, I have a document with a format like this: <doc>text1<b>text2</b>text3<b>text4</b>text5</doc> I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example: from lxml import html doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') print doc.text # 'text1' print doc.tail # '' print doc.text_content() # 'text1text2text3text4text5' for child in doc: child.drop_tree() print doc.text # 'text1text3text5'
From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text?
regards, Richard