
Alexandre Delanoƫ, 05.11.2013 13:19:
Hello and many thanks for lxml module!
I am discovering its capabilities but I can hardly end up my first parser, because of 2 issues.
First, I have a valid file.html to parse, here is an example:
<html> <body> <tbody> <tr> <td> <table id="tbHeaderImages"> <span class="Title"> BLO BLO <span style="color:red; font-weight:bold">BLU BLU</span> BLA BLA </span> <br> <br> Some texttext <br> ... </table> </tbody> </body> </html>
I have many documents each inside table: for document in TREE.xpath('/html/body/table'):
Then, I want to extract each title: title = document.xpath("./tr/td/span[@class = 'Title']/text()")
Problem 1: I get "BLO BLO" and "BLA BLA" whereas I would like "BLO BLO" and "BLU BLU" and "BLA BLA".
You could try either of the following: title = document.xpath("./tr/td/span[@class = 'Title']//text()") title = document.xpath("string(./tr/td/span[@class = 'Title'])") or even text = etree.tostring( document.xpath("./tr/td/span[@class = 'Title']")[0], method='text') depending on what exactly you want as a result.
Problem 2: I want also to extract "Some texttext" just after the span class. Then, I have to trigger an event in an SAX like method. If you agree with with how I could do such trick if I am already working in the TREE.xpath('/html/body/table') ?
Again, multiple ways to do it. You could delete the <span> tag and then serialise the table as text, for example. Or iterate over the children until you find the right <br> and read its tail. Stefan