[lxml-dev] Text obscured by subelement
hello, I have a document with a format like this: <doc>text1<b>text2</b>text3<b>text4</b>text5</doc> I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example: from lxml import html doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') print doc.text # 'text1' print doc.tail # '' print doc.text_content() # 'text1text2text3text4text5' for child in doc: child.drop_tree() print doc.text # 'text1text3text5'
From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text?
regards, Richard
On Sun, 24 Aug 2008, Richard Baron Penman wrote:
I have a document with a format like this: <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example:
from lxml import html doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
[...]
From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text? [...]
I only have 1.3.6 installed, so don't have the HTML support, but you want to use the .tail of the b elements I think. With the XML API: from lxml.etree import fromstring doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') b1, b2 = doc.getchildren() print doc.text + b1.tail + b2.tail John
John J Lee <jjl@pobox.com> (JJL) wrote:
JJL> On Sun, 24 Aug 2008, Richard Baron Penman wrote:
I have a document with a format like this: <doc>text1<b>text2</b>text3<b>text4</b>text5</doc>
I want to extract 'text1text3text5' from <doc> but the text attribute returns just 'text1'. Here is an example:
from lxml import html doc = html.fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>')
JJL> [...]
From the example you can see I can get what I want by first dropping the subelements. Is there a better way to access this text? JJL> [...]
JJL> I only have 1.3.6 installed, so don't have the HTML support, but you want JJL> to use the .tail of the b elements I think. With the XML API:
JJL> from lxml.etree import fromstring JJL> doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') JJL> b1, b2 = doc.getchildren() JJL> print doc.text + b1.tail + b2.tail
print doc.text+''.join(c.tail for c in doc.getchildren()) -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org
Piet van Oostrum wrote:
John J Lee <jjl@pobox.com> (JJL) wrote: JJL> doc = fromstring('<doc>text1<b>text2</b>text3<b>text4</b>text5</doc>') JJL> b1, b2 = doc.getchildren() JJL> print doc.text + b1.tail + b2.tail
print doc.text+''.join(c.tail for c in doc.getchildren())
print doc.text+''.join(c.tail for c in doc) Stefan
participants (4)
-
John J Lee
-
Piet van Oostrum
-
Richard Baron Penman
-
Stefan Behnel