[lxml-dev] html parsing, .text
When i try get text from tag in html it return text only if no tag is before this text. Here is demonstrating code : import lxml.html html = """<strong><strong>another text</strong><br>some text</strong>""" doc = lxml.html.fromstring(html) print doc.text_content() # "some text" is here but when i try get text for this tag then: print doc.text # return None, but it have text : "some text" for a in doc: a.text # no subtag have text "some text" it s only work if text is before tags: html = """<strong>some text<strong>another text</strong><br></strong>""" But i need parsing web page with text after tags. Can you help me ? version : lxml.etree: (2, 2, 6, 0) libxml used: (2, 7, 7) libxml compiled: (2, 7, 6) libxslt used: (1, 1, 26) libxslt compiled: (1, 1, 26)
you need to use also the "tail" property. "text" is for the text inside the element, tail is for the text after the element is closed. for a in doc: print a.text, a.tail Cheers, On Fri, Jan 28, 2011 at 3:45 PM, Miro Mintal <miromintal@gmail.com> wrote:
When i try get text from tag in html it return text only if no tag is before this text.
Here is demonstrating code :
import lxml.html html = """<strong><strong>another text</strong><br>some text</strong>""" doc = lxml.html.fromstring(html) print doc.text_content() # "some text" is here but when i try get text for this tag then: print doc.text # return None, but it have text : "some text" for a in doc: a.text # no subtag have text "some text"
it s only work if text is before tags: html = """<strong>some text<strong>another text</strong><br></strong>"""
But i need parsing web page with text after tags. Can you help me ?
version : lxml.etree: (2, 2, 6, 0) libxml used: (2, 7, 7) libxml compiled: (2, 7, 6) libxslt used: (1, 1, 26) libxslt compiled: (1, 1, 26) _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
-- Joaquin Cuenca Abela
On Fri, 28 Jan 2011, Joaquin Cuenca Abela wrote: +-- | you need to use also the "tail" property. "text" is for the | text inside the element, tail is for the text after the element | is closed. +-- In the lxml/ElementTree world, the way mixed content works is, in my experience, the hardest thing to understand. Perhaps a picture might help: http://www.nmt.edu/tcc/help/pubs/pylxml/etree-view.html Because I came from a DOM background, when I first saw how the .tail attribute works, I completely rejected the entire framework because I thought it was ugly. What changed my mind was performance. With Python's minidom it took one program about 35 seconds to read a half-megabyte XML file; lxml read that same file in 600 milliseconds. Once I started actually using lxml, I found that handling mixed content is not that bad at all. Appended below my .signature is a little function that I use everywhere to append text as the child of an element without having to worry about where it goes. Forgive me for promoting my own work, but the document containing the above link describes how to use lxml for reading, writing, and updating XML. It also includes an annotated version of Fredrik Lundh's builder.py module which makes code to generate XML much more straightforward and compact. http://www.nmt.edu/tcc/help/pubs/pylxml/etree-view.html Best regards, John Shipman (john@nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (575) 835-5735, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber ================================================================ def addText ( node, s ): '''Add text content to an element. [ (node is an Element) and (s is a string) -> if node has any children -> last child's .tail := last child's tail + s else -> node.text := node.text + s ] ''' if len(node) == 0: node.text = (node.text or "") + s else: lastChild = node[-1] lastChild.tail = (lastChild.tail or "") + s
participants (3)
-
Joaquin Cuenca Abela
-
John W. Shipman
-
Miro Mintal