hello,
I have a document with a format like this:
<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example:
from lxml import html
doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
print doc.text # 'text1'
print doc.tail # ''
print doc.text_content() # 'text1text2text3text4text5'
for child in doc:
child.drop_tree()
print doc.text # 'text1text3text5'
From the example you can see I can get what I want by first dropping the subelements.
Is there a better way to access this text?
regards,
Richard