
Année 2013, mardi 05 novembre, vers 18:02, Holger Joukl écrivait:
Hi,
Hello Holger,
"lxml" <lxml-bounces@lxml.de> schrieb am 05.11.2013 17:00:48:
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it.
I was thinking about this one:
Or iterate over the children until you find the right <br> and read its tail.
But what is the syntaxe with the example I gave you ? (Sorry but I better understand now thanks to your example with the exact syntaxe)
Already delete your original mail but this might get you going:
tree = etree.fromstring('<root><mixed>mixed text<x>x text</x>x tail</mixed></root>') print etree.tostring(tree, pretty_print=True) <root> <mixed>mixed text<x>x text</x>x tail</mixed> </root>
for elem in tree.iter(): ... if elem.tail: # use appropriate condition here to select your elements ... print elem, elem.tail ... <Element x at 0x3a2d78> x tail
Thanks, then I could trigger an event with elem.tag but I started the script like this: file = "file.html" parser = etree.HTMLParser() tree = etree.parser(file, parser) And I started to navigate inside the file with xpath: for document in tree.xpath('/html/body/table'): [...] for data in document.xpath("./tr/td/") ... here are my data after <br><br> Then I am looking for something like data.iter() from xpath result, but it is not possible. How to get a tree (not fromstring) but from where I am during the parsing ? -- Alexandre Here is a copy of precedent mail if needed: Date: Tue, 05 Nov 2013 13:50:10 +0100 From: Stefan Behnel <stefan_ml@behnel.de> To: lxml mailing list <lxml@lxml.de> Subject: Re: [lxml] Xpath, span control and iterate inside xpath result. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 Alexandre Delanoë, 05.11.2013 13:19:
Hello and many thanks for lxml module!
I am discovering its capabilities but I can hardly end up my first parser, because of 2 issues.
First, I have a valid file.html to parse, here is an example:
<html> <body> <tbody> <tr> <td> <table id="tbHeaderImages"> <span class="Title"> BLO BLO <span style="color:red; font-weight:bold">BLU BLU</span> BLA BLA </span> <br> <br> Some texttext <br> ... </table> </tbody> </body> </html>
I have many documents each inside table: for document in TREE.xpath('/html/body/table'):
Then, I want to extract each title: title = document.xpath("./tr/td/span[@class = 'Title']/text()")
Problem 1: I get "BLO BLO" and "BLA BLA" whereas I would like "BLO BLO" and "BLU BLU" and "BLA BLA".
You could try either of the following: title = document.xpath("./tr/td/span[@class = 'Title']//text()") title = document.xpath("string(./tr/td/span[@class = 'Title'])") or even text = etree.tostring( document.xpath("./tr/td/span[@class = 'Title']")[0], method='text') depending on what exactly you want as a result.
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it. You could delete the <span> tag and then serialise the table as text, for example. Or iterate over the children until you find the right <br> and read its tail. Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml