Hello all, Suppose this example: parser = etree.HTMLParser() tree = etree.parse(file, parser) docs = tree.xpath('/html/body/table') And considering each doc structure: <tr> <td> <span class="DocHeader">NOT WANTED DATA</span> TEXT1 <p xmlns:scripts="urn:scripts.this" </p> TEXT2 TEXT3 <font color="red" xmlns:scripts="urn:scripts.this">RED</font> TEXT4. </td> </tr> for doc in docs: print doc.xpath("./tr/td/text()") Result: TEXT1 TEXT2 TEXT3 TEXT4 How to get text in font like this: TEXT1 TEXT2 TEXT3 RED TEXT4 I am not satisfied with the following solution tree.xpath("//tr/td/text()") : since it does not grab the text font tree.xpath("//*/text()") : since it does to many results How to get xpath combining "normal" text and "font" text ? I have not found the solution yet, many thanks for your help (again). -- Alexandre Delanoë
What about //tr/td/descendant-or-self::*[not(self::span[@class="DocHeader"])]/text() ? (with or without the [@class="DocHeader"] predicate) On Thu, Apr 24, 2014 at 7:00 AM, Alexandre Delanoë <debian@delanoe.org> wrote:
Hello all,
Suppose this example:
parser = etree.HTMLParser() tree = etree.parse(file, parser) docs = tree.xpath('/html/body/table')
And considering each doc structure:
<tr> <td> <span class="DocHeader">NOT WANTED DATA</span>
TEXT1
<p xmlns:scripts="urn:scripts.this" </p> TEXT2
TEXT3 <font color="red" xmlns:scripts="urn:scripts.this">RED</font> TEXT4. </td> </tr>
for doc in docs: print doc.xpath("./tr/td/text()")
Result: TEXT1 TEXT2 TEXT3 TEXT4
How to get text in font like this: TEXT1 TEXT2 TEXT3 RED TEXT4
I am not satisfied with the following solution tree.xpath("//tr/td/text()") : since it does not grab the text font tree.xpath("//*/text()") : since it does to many results
How to get xpath combining "normal" text and "font" text ?
I have not found the solution yet, many thanks for your help (again).
-- Alexandre Delanoë _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
Année 2014, jeudi 24 avril, vers 09:16, Paul Tremberth écrivait:
What about //tr/td/descendant-or-self::*[not(self::span[@class="DocHeader"])]/text() ? (with or without the [@class="DocHeader"] predicate)
Nice guess, this trick does exactly what needed: .xpath("./tr/td/descendant-or-self::*/text()") Many thanks, I did not know "descendant-or-self", where these options could be documented ? Many thanks again, -- Alexandre Delanoë
Année 2014, jeudi 24 avril, vers 09:25, Alexandre Delanoë écrivait:
Année 2014, jeudi 24 avril, vers 09:16, Paul Tremberth écrivait:
What about //tr/td/descendant-or-self::*[not(self::span[@class="DocHeader"])]/text() ? (with or without the [@class="DocHeader"] predicate)
I will need the predicate indeed.
Nice guess, this trick does exactly what needed:
.xpath("./tr/td/descendant-or-self::*/text()")
Many thanks, I did not know "descendant-or-self", where these options could be documented ?
okay, sorry: http://www.w3schools.com/xpath/xpath_axes.asp Have a very nice day. -- Alexandre Delanoë
I prefer to refer to http://www.w3.org/TR/xpath/ (XPath 1.0) // is short for /descendant-or-self::node()/. For example, //para is short
for /descendant-or-self::node()/child::para and so will select any para element in the document (even a para element that is a document element will be selected by //para since the document element node is a child of the root node); div//para is short for div/descendant-or-self::node()/child::para and so will select all para descendants of div children.
Other axis names: [6] AxisName ::= 'ancestor' | 'ancestor-or-self' | 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'namespace' | 'parent' | 'preceding' | 'preceding-sibling' | 'self' On Thu, Apr 24, 2014 at 9:33 AM, Alexandre Delanoë <debian@delanoe.org>wrote:
Année 2014, jeudi 24 avril, vers 09:25, Alexandre Delanoë écrivait:
Année 2014, jeudi 24 avril, vers 09:16, Paul Tremberth écrivait:
What about
//tr/td/descendant-or-self::*[not(self::span[@class="DocHeader"])]/text()
? (with or without the [@class="DocHeader"] predicate)
I will need the predicate indeed.
Nice guess, this trick does exactly what needed:
.xpath("./tr/td/descendant-or-self::*/text()")
Many thanks, I did not know "descendant-or-self", where these options could be documented ?
okay, sorry:
http://www.w3schools.com/xpath/xpath_axes.asp
Have a very nice day.
-- Alexandre Delanoë
participants (2)
-
Alexandre Delanoë -
Paul Tremberth