Xpath, span control and iterate inside xpath result.

Hello and many thanks for lxml module! I am discovering its capabilities but I can hardly end up my first parser, because of 2 issues. First, I have a valid file.html to parse, here is an example: <html> <body> <tbody> <tr> <td> <table id="tbHeaderImages"> <span class="Title"> BLO BLO <span style="color:red; font-weight:bold">BLU BLU</span> BLA BLA </span> <br> <br> Some texttext <br> ... </table> </tbody> </body> </html> I have many documents each inside table: for document in TREE.xpath('/html/body/table'): Then, I want to extract each title: title = document.xpath("./tr/td/span[@class = 'Title']/text()") Problem 1: I get "BLO BLO" and "BLA BLA" whereas I would like "BLO BLO" and "BLU BLU" and "BLA BLA". I have heard that I should get the value of the text from the first node. Get the second node (not the text value) and use the .text attribute to replace the text. But I do not really understand how to manage it. Please may you give me an example. Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ? Many thanks for your help, -- Alexandre Delanoë

Alexandre Delanoë, 05.11.2013 13:19:
Hello and many thanks for lxml module!
I am discovering its capabilities but I can hardly end up my first parser, because of 2 issues.
First, I have a valid file.html to parse, here is an example:
<html> <body> <tbody> <tr> <td> <table id="tbHeaderImages"> <span class="Title"> BLO BLO <span style="color:red; font-weight:bold">BLU BLU</span> BLA BLA </span> <br> <br> Some texttext <br> ... </table> </tbody> </body> </html>
I have many documents each inside table: for document in TREE.xpath('/html/body/table'):
Then, I want to extract each title: title = document.xpath("./tr/td/span[@class = 'Title']/text()")
Problem 1: I get "BLO BLO" and "BLA BLA" whereas I would like "BLO BLO" and "BLU BLU" and "BLA BLA".
You could try either of the following: title = document.xpath("./tr/td/span[@class = 'Title']//text()") title = document.xpath("string(./tr/td/span[@class = 'Title'])") or even text = etree.tostring( document.xpath("./tr/td/span[@class = 'Title']")[0], method='text') depending on what exactly you want as a result.
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it. You could delete the <span> tag and then serialise the table as text, for example. Or iterate over the children until you find the right <br> and read its tail. Stefan

Année 2013, mardi 05 novembre, vers 13:50, Stefan Behnel écrivait:
You could try either of the following: title = document.xpath("./tr/td/span[@class = 'Title']//text()")
Exactly what I was looking for.
title = document.xpath("string(./tr/td/span[@class = 'Title'])")
Works fine (I just have to add .encode('UTF-8') at the end.
or even
text = etree.tostring( document.xpath("./tr/td/span[@class = 'Title']")[0], method='text')
Many thanks for all these solution. I know better understand how it works.
depending on what exactly you want as a result.
Sure!
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it.
I was thinking about this one:
Or iterate over the children until you find the right <br> and read its tail.
But what is the syntaxe with the example I gave you ? (Sorry but I better understand now thanks to your example with the exact syntaxe) Many thanks for your help, -- Alexandre Delanoë

Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in
Hi, "lxml" <lxml-bounces@lxml.de> schrieb am 05.11.2013 17:00:48: the
TREE.xpath('/html/body/table') ?
Again, multiple ways to do it.
I was thinking about this one:
Or iterate over the children until you find the right <br> and read its tail.
But what is the syntaxe with the example I gave you ? (Sorry but I better understand now thanks to your example with the exact syntaxe)
Already delete your original mail but this might get you going:
tree = etree.fromstring('<root><mixed>mixed text<x>x text</x>x tail</mixed></root>') print etree.tostring(tree, pretty_print=True) <root> <mixed>mixed text<x>x text</x>x tail</mixed> </root>
for elem in tree.iter(): ... if elem.tail: # use appropriate condition here to select your elements ... print elem, elem.tail ... <Element x at 0x3a2d78> x tail
Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart

Année 2013, mardi 05 novembre, vers 18:02, Holger Joukl écrivait:
Hi,
Hello Holger,
"lxml" <lxml-bounces@lxml.de> schrieb am 05.11.2013 17:00:48:
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it.
I was thinking about this one:
Or iterate over the children until you find the right <br> and read its tail.
But what is the syntaxe with the example I gave you ? (Sorry but I better understand now thanks to your example with the exact syntaxe)
Already delete your original mail but this might get you going:
tree = etree.fromstring('<root><mixed>mixed text<x>x text</x>x tail</mixed></root>') print etree.tostring(tree, pretty_print=True) <root> <mixed>mixed text<x>x text</x>x tail</mixed> </root>
for elem in tree.iter(): ... if elem.tail: # use appropriate condition here to select your elements ... print elem, elem.tail ... <Element x at 0x3a2d78> x tail
Thanks, then I could trigger an event with elem.tag but I started the script like this: file = "file.html" parser = etree.HTMLParser() tree = etree.parser(file, parser) And I started to navigate inside the file with xpath: for document in tree.xpath('/html/body/table'): [...] for data in document.xpath("./tr/td/") ... here are my data after <br><br> Then I am looking for something like data.iter() from xpath result, but it is not possible. How to get a tree (not fromstring) but from where I am during the parsing ? -- Alexandre Here is a copy of precedent mail if needed: Date: Tue, 05 Nov 2013 13:50:10 +0100 From: Stefan Behnel <stefan_ml@behnel.de> To: lxml mailing list <lxml@lxml.de> Subject: Re: [lxml] Xpath, span control and iterate inside xpath result. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 Alexandre Delanoë, 05.11.2013 13:19:
Hello and many thanks for lxml module!
I am discovering its capabilities but I can hardly end up my first parser, because of 2 issues.
First, I have a valid file.html to parse, here is an example:
<html> <body> <tbody> <tr> <td> <table id="tbHeaderImages"> <span class="Title"> BLO BLO <span style="color:red; font-weight:bold">BLU BLU</span> BLA BLA </span> <br> <br> Some texttext <br> ... </table> </tbody> </body> </html>
I have many documents each inside table: for document in TREE.xpath('/html/body/table'):
Then, I want to extract each title: title = document.xpath("./tr/td/span[@class = 'Title']/text()")
Problem 1: I get "BLO BLO" and "BLA BLA" whereas I would like "BLO BLO" and "BLU BLU" and "BLA BLA".
You could try either of the following: title = document.xpath("./tr/td/span[@class = 'Title']//text()") title = document.xpath("string(./tr/td/span[@class = 'Title'])") or even text = etree.tostring( document.xpath("./tr/td/span[@class = 'Title']")[0], method='text') depending on what exactly you want as a result.
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it. You could delete the <span> tag and then serialise the table as text, for example. Or iterate over the children until you find the right <br> and read its tail. Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml

Hi,
then I could trigger an event with elem.tag but I started the script like this:
file = "file.html" parser = etree.HTMLParser() tree = etree.parser(file, parser)
And I started to navigate inside the file with xpath: for document in tree.xpath('/html/body/table'): [...] for data in document.xpath("./tr/td/") ... here are my data after <br><br>
Then I am looking for something like data.iter() from xpath result, but it is not possible.
How to get a tree (not fromstring) but from where I am during the parsing ?
Not sure I understand. After parsing you have a tree. If you want to control events during parsing, while building the tree, you should probably have a look at http://lxml.de/parsing.html#iterparse-and-iterwalk If you want to iterate through XPath results you'll need to take care of what your XPath results actually are (please look at the lxml xpath docs): etree.Element results have an iter() method themselves:
tree = etree.fromstring('<root><sub><subsub>bla</subsub></sub></root>') tree.xpath('//subsub') [<Element subsub at 0x7fb7a43c1190>] tree.xpath('//subsub')[0] <Element subsub at 0x7fb7a43c1190> tree.xpath('//subsub')[0].iter() <lxml.etree.ElementDepthFirstIterator object at 0x7fb7a43c11e0>
While string results do not:
tree.xpath('//subsub/text()') ['bla'] tree.xpath('//subsub/text()')[0].iter() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: '_ElementStringResult' object has no attribute 'iter'
Caution: Strings are iterable themselves though:
iter(tree.xpath('//subsub/text()')[0]) <iterator object at 0x7fb7a43bb390> list(iter(tree.xpath('//subsub/text()')[0])) ['b', 'l', 'a']
Holger

Hello all, again many thanks for your replies and for your "Verständnis" since even if I have carefully read the tutorials, I did not get something. With your answers, I have understood the syntax to iter xpath results: Année 2013, mardi 05 novembre, vers 21:11, jholg@gmx.de écrivait:
tree.xpath('//subsub')[0].iter() <lxml.etree.ElementDepthFirstIterator object at 0x7fb7a43c11e0>
Then I have solved my issues thanks to you (all). I am a very happy new user of lxml.etree... -- Alexandre Delanoë
participants (4)
-
Alexandre Delanoë
-
Holger Joukl
-
jholg@gmx.de
-
Stefan Behnel