[lxml-dev] Wrapping one Tag around existing text
Hello, I was trying to find all the text elements in an html doc whose parents are divs and make it so their parents were p=elements. e.g. turn this: <div><h1>Title</h1>text<h2>stuff</h2></div> into this: <div><h1>Title</h1><p>text</p><h2>stuff</h2></div> so far I have: for text in (x for x in parsed.xpath('//div/text()') if len(x.strip())): -----p = builder.P(''+text) #I have to put ""+ because it doesnt recognize text is not a string type and str() chokes on string conversion because of unicode. I just dont know how to insert the created p element into the div while deleting the text node. I tried text.getparent().insert(text.getparent().index(text), p) but it says that the argument is supposed to be an _Element not an _ElementStringResult Also it says that the _ElementStringResult "text" 's parent is h1, and not div. Why is this? In order for me to get the div container (the true parent of 'text') I had to do text.getparent().getparent() Thanks
Hey guys, I actually just finished it. For future reference in case anybody needs it: for text in (x for x in self.parsed.xpath('//div/text()') if len(x.strip())): --p = E.P(''+text) --if text.is_tail: ----textIndex = text.getparent().getparent().index(text.getparent()) + 1 ----text.getparent().getparent().insert(textIndex, p) ----text.getparent().tail = None --elif text.is_text: ----text.getparent().getparent().insert(0, p) ----text.getparent().text = None Although I do have some issues. My main issue is the way that text is handled with lxml in parsing an HTML document. I can understand .tail and .text attributes for XML, but it is my belief that LXML should handle text in HTML like an Element. Because the tail text for an Element in LXML in the context of HTML has nothing to do with the Element, I believe that lxml.html should handle text like an element, and should be included in the list. e.g. for this: parsed = lxml.html.document_fromstring('<html><body><div>This is some text<a>that has links</a>which interrupts the text.<a>In</a> a couple <a>of</a> places.</div></body></html>') list(parsed.cssselect('div')[0]) should return [<_TextElement>, <_Element>, <_TextElement>, <_Element>, <_TextElement>, <_Element>, <_TextElement>] print str(parsed.cssselect('div')[0][0]) should return "This is some text" parsed.cssselect('div')[0][1] is the a container with the text "that has links" parsed.xpath('//text()') should return [<_TextElement>, <_TextElement>, <_TextElement>, <_TextElement>] Also _Element.index should work properly with text. parsed.cssselect('div')[0].index(parsed.xpath('//text()')[1]) would return 2 I believe this because it is more intuitive. In my example the text "a couple" should have only a sibling relationship with the A container, but lxml.html designs it so that the "parent" of the text is the A container and not the true parent (in which it would gain all of its attributes and non from the A container) the DIV. Why would you look in the A container when the text isn't even in it? Although I am sure that there is something preventing this from happening, I would appreciate if it was considered. -- Kyle Hanson On Mon, Jun 21, 2010 at 2:46 PM, Kyle Hanson <hanooter@gmail.com> wrote:
Hello,
I was trying to find all the text elements in an html doc whose parents are divs and make it so their parents were p=elements.
e.g.
turn this: <div><h1>Title</h1>text<h2>stuff</h2></div>
into this:
<div><h1>Title</h1><p>text</p><h2>stuff</h2></div>
so far I have:
for text in (x for x in parsed.xpath('//div/text()') if len(x.strip())): -----p = builder.P(''+text) #I have to put ""+ because it doesnt recognize text is not a string type and str() chokes on string conversion because of unicode.
I just dont know how to insert the created p element into the div while deleting the text node. I tried text.getparent().insert(text.getparent().index(text), p) but it says that the argument is supposed to be an _Element not an _ElementStringResult
Also it says that the _ElementStringResult "text" 's parent is h1, and not div. Why is this? In order for me to get the div container (the true parent of 'text') I had to do text.getparent().getparent()
Thanks
Kyle Hanson, 22.06.2010 05:23:
parsed = lxml.html.document_fromstring('<html><body><div>This is some text<a>that has links</a>which interrupts the text.<a>In</a> a couple <a>of</a> places.</div></body></html>') [...] In my example the text "a couple" should have only a sibling relationship with the A container, but lxml.html designs it so that the "parent" of the text is the A container and not the true parent
Well, it *is* the 'true parent' in the tree model. You'd be rather surprised if the text wasn't accessible on the Element that getparent() returned. There isn't a sibling relationship for text, and I certainly don't want to add that as well. So, changing the parent would mean that you'd have to search all children of your proposed parent in order to find the text (and if you're unlucky, you'd find the text more than once that way). In the current implementation, all you have to do to get to the Element that holds the text is to call getparent(). And to get to the surrounding container of tail text, you can call getparent() twice. That's a lot simpler than an unsafe subtree search, or an additional sibling API just for tail text content. Stefan
participants (2)
-
Kyle Hanson
-
Stefan Behnel