lxml support of group condition for xpath ?
Hi there, I am truly sorry if the question was asked before, I have searched but it either seems it was not asked, or I do not use the right words... I am trying to deal with group node condition in xpath, /ie/ xpath like //div/(descendant::p|descendant::l)[@n='1'] //div/(ns:p|ns:l)[@n='1'] While this seems to be supported in xpath, I cannot find a way to make it work. My hands are a little tied here because I have not that much freedom about the xpath in itself, I need to be able to support a xpath like that, or rather, I can't go with another simpler route like //div/p[@n='1'] or //div/l[@n='1'] //ns:div/ns:p[@n='1'] or //ns:div/ns:l[@n='1'] The weird thing is that the following xpath would work (//tei:l|//tei:p)[@n='1'] But the moment I add something front of it, it breaks. [node for node in xml.xpath("(//tei:l|//tei:p)[@n='1']", namespaces={"tei": " http://www.tei-c.org/ns/1.0"})] Thank you for your time and help Best Thibault
Hi Thibault, On Wed, 25 Nov 2015 14:22:20 +0100 Thibault Clerice <leponteineptique@gmail.com> wrote:
[...] I am trying to deal with group node condition in xpath, /ie/ xpath like
//div/(descendant::p|descendant::l)[@n='1'] //div/(ns:p|ns:l)[@n='1']
I'm not 100% sure, but I guess this XPath expression is only supported in XPath 1.0.
While this seems to be supported in xpath,
... only in XPath 2.0 if I'm not mistaken.
I cannot find a way to make it work. My hands are a little tied here because I have not that much freedom about the xpath in itself, I need to be able to support a xpath like that, or rather, I can't go with another simpler route like
You can always "resolve" the group which should work: //div/descendant::p[@n='1'] | //div/descendant::l[@n='1'] I know, it's a little bit verbose. But maybe you optimize it with some lxml and Python magic.
//div/p[@n='1'] or //div/l[@n='1'] //ns:div/ns:p[@n='1'] or //ns:div/ns:l[@n='1']
The weird thing is that the following xpath would work
(//tei:l|//tei:p)[@n='1']
This *is* supported in XPath 1.0.
[...]
Hope this helps a bit. -- Gruß/Regards, Thomas Schraitle
Hi Thomas, Thanks for the explanation, I tried to find the documentation about XPath 1 and Xpath 2 to ensure this was not due to that, but could not find the needle in the haystack. Unfortunately, I cannot use the syntax you gave, but I found two alternatives, one being given to me by someone on the list in a private mail (Jamie): - //div//*[local-name()='p' or local-name()='l'][@n='1'] - //div//*[lself::ns:p or self::ns:l][@n='1'] Work both terribly well. In case anyone is like me, stuck with not so much ability to go the /a/a | /a/b road. Thanks a lot, Thibault On 11/26/2015 09:10 AM, Thomas Schraitle wrote:
Hi Thibault,
On Wed, 25 Nov 2015 14:22:20 +0100 Thibault Clerice <leponteineptique@gmail.com> wrote:
[...] I am trying to deal with group node condition in xpath, /ie/ xpath like
//div/(descendant::p|descendant::l)[@n='1'] //div/(ns:p|ns:l)[@n='1'] I'm not 100% sure, but I guess this XPath expression is only supported in XPath 1.0.
While this seems to be supported in xpath, ... only in XPath 2.0 if I'm not mistaken.
I cannot find a way to make it work. My hands are a little tied here because I have not that much freedom about the xpath in itself, I need to be able to support a xpath like that, or rather, I can't go with another simpler route like You can always "resolve" the group which should work:
//div/descendant::p[@n='1'] | //div/descendant::l[@n='1']
I know, it's a little bit verbose. But maybe you optimize it with some lxml and Python magic.
//div/p[@n='1'] or //div/l[@n='1'] //ns:div/ns:p[@n='1'] or //ns:div/ns:l[@n='1']
The weird thing is that the following xpath would work
(//tei:l|//tei:p)[@n='1'] This *is* supported in XPath 1.0.
[...] Hope this helps a bit.
Hi Thibault, On Thu, 26 Nov 2015 09:16:13 +0100 Thibault Clerice <leponteineptique@gmail.com> wrote:
Thanks for the explanation, I tried to find the documentation about XPath 1 and Xpath 2 to ensure this was not due to that, but could not find the needle in the haystack.
Me neither.
Unfortunately, I cannot use the syntax you gave, but I found two alternatives, one being given to me by someone on the list in a private mail (Jamie):
- //div//*[local-name()='p' or local-name()='l'][@n='1'] - //div//*[lself::ns:p or self::ns:l][@n='1']
Just to make it clear: the first expression matches *all* 'p' and 'l' elements regardless to which namespace they belong. This may or may not what you want. Probably this works in most cases, but you should be aware of that if you have a mixture of different elements from different namespaces this will be an issue.
Work both terribly well. In case anyone is like me, stuck with not so much ability to go the /a/a | /a/b road.
Another solution would be to avoid XPath and use lxml/Python. You can iterate through your tree, maybe something like this:
NS = "{YOUR_NAMESPACE}" tree = etree.parse("some.xml") root = tree.getroot() alldivs = [i for i in root.iterdescendants("{}div".format(NS)) ] all_l_p = [i for d in alldivs for i in d.iterdescendants() \ if i.tag in ("{}p".format(NS), "{}l".format(NS)) ]
Of course you need to filter the additional 'n' attribute too, but I hope the idea is clear. -- Gruß/Regards, Thomas Schraitle
Thursday, November 26, 2015, 11:47:00 Thomas:
Hi Thibault,
On Thu, 26 Nov 2015 09:16:13 +0100 Thibault Clerice <leponteineptique@gmail.com> wrote:
Thanks for the explanation, I tried to find the documentation about XPath 1 and Xpath 2 to ensure this was not due to that, but could not find the needle in the haystack.
Me neither.
Needle in a haystack, hah! Use a magnet! 1. Googling for "xpath specification" gives: http://www.w3.org/TR/xpath/, http://www.w3.org/TR/xpath20/ The specs include the complete formal grammar. Each grammar section is preceeded by explanation of the corresponding semantics. In the W3 docs, they even make each mention of an entity in a grammar expression a link to its definition for convenient browsing! (the scientific term for the "entities" is "nonterminal symbols", so the anchors start with "NT") 2. searching the 1.0 spec for "|" shows it's used in formal grammar syntax, and the literals in it are decorated like "'<character>'" 3. searching for "'|'" leads to http://www.w3.org/TR/xpath/#NT-UnionExpr which says it's a "union operator" that "computes the union of its operands, which must be node-sets". 4. http://www.w3.org/TR/xpath/#NT-PathExpr just below shows you can also _precede_ a path with a "filter expression" 5. Looking at Predicate (=square-bracketed expression as per http://www.w3.org/TR/xpath/#NT-Predicate) definition and futher along the links, leads to http://www.w3.org/TR/xpath/#NT-OrExpr which indeed is the proper one for a logical OR in a predicate (predicate = a test performed on every item in a set). UnionExpr can be part of it, too, through UnaryExpr. Tests show the distinction is: "|" returns a set while "or" returns a boolean: In [100]: t.xpath("(descendant::l|descendant::p)") Out[100]: [<Element p at 0x16890f8>, <Element l at 0x1689058>] In [101]: t.xpath("(descendant::l or descendant::p)") Out[101]: True 6. "(descendant::p|descendant::l)//*[@n=1]" on the attached file gives proper results. But, upon moving it inside the predicate, nothing is returned. Strange. "(descendant::p|descendant::l)" gives a strange result: only the "p" and "l" tags. What's up with that? Hmm... why not try it the other way round and see what happens? In [54]: t.xpath("//*[(ancestor::p)]") Out[54]: [<Element a at 0x1548440>, <Element a at 0x16808a0>] Eureka! "descendant::p" does not mean "descendants of p"!! http://www.w3.org/TR/xpath/#section-Location-Steps says there are actually 3 steps of searching: axes, node-tests and predicates 1) as an axis (=outside square brackets and before path), "descendants::p" selects descendants of the currently considered set (initially, this is the root node), then subjects them to the node test ("p"=all <p> tags), so it would select all "p" nodes 2) as a predicate, however, it *selects the descendants of the currently considered node and checks if there's any that satisfies the node test*! Node test stage (http://www.w3.org/TR/xpath/#NT-NodeTest) has very limited functionality. In particular, it cannot use patterns except "*" (only single names), so we can't select both tags at the same time with it. So, the expression you're looking for is either (descendant::p|descendant::l)//*[@n=1] (selects tags at axis stage) which would select all "p" tags, then all "l" tags, then iterate over all elements in their subtrees ("//*") and check every one for the condition, or (//p|//l)//*[@n=1] which is equivalent for our case: searching the spec for "//" finds http://www.w3.org/TR/xpath/#path-abbrev which says "// is short for /descendant-or-self::node()/", or //*[(ancestor::p or ancestor::l) and @n=1] (selects tags at predicate stage) which would iterate over _all_ nodes ("//*" - subtree of root), and for every one check if it has "p" or "l" as an ancestor (which is much more work) etc, or //*[name()='p' or name()='l']//*[@n=1] to find both kinds of tags in one go and only then explore their subtrees - this looks like the most efficient way. "|" can be used in the 3rd case as well, but in a predicate, it can behave unpredictably: In [125]: t.xpath("//*[(ancestor::p|@n=1)]") Out[125]: [<Element a at 0x163a8c8>, <Element b at 0x163a120>, <Element c at 0x163adf0>] In [126]: t.xpath("//*[(ancestor::p or @n=1)]") Out[126]: [<Element a at 0x163a8c8>, <Element a at 0x163afd0>, <Element b at 0x163a120>, <Element c at 0x163adf0>] In [131]: t.xpath("//*[(@n=1|ancestor::p)]") XPathEvalError: Invalid type ..so you should better use "|" in a FilterExpr and "or" in a Predicate. The precise rules for handling sets in predicates are outlined in http://www.w3.org/TR/xpath/#booleans and are rather convoluted, so take Python Zen to heart and don't mix sets with booleans if you can do without. 7. Searching for "'|'" in the 2.0 spec gives nothing, neither does searching for "union" yield any references to a "union operation". so '|' is indeed specific to 1.0 . P.S. I didn't know any of this before composing the letter.
Unfortunately, I cannot use the syntax you gave, but I found two alternatives, one being given to me by someone on the list in a private mail (Jamie):
- //div//*[local-name()='p' or local-name()='l'][@n='1'] - //div//*[lself::ns:p or self::ns:l][@n='1']
Just to make it clear: the first expression matches *all* 'p' and 'l' elements regardless to which namespace they belong. This may or may not what you want.
Probably this works in most cases, but you should be aware of that if you have a mixture of different elements from different namespaces this will be an issue.
Work both terribly well. In case anyone is like me, stuck with not so much ability to go the /a/a | /a/b road.
Another solution would be to avoid XPath and use lxml/Python. You can iterate through your tree, maybe something like this:
NS = "{YOUR_NAMESPACE}" tree = etree.parse("some.xml") root = tree.getroot() alldivs = [i for i in root.iterdescendants("{}div".format(NS)) ] all_l_p = [i for d in alldivs for i in d.iterdescendants() \ if i.tag in ("{}p".format(NS), "{}l".format(NS)) ]
Of course you need to filter the additional 'n' attribute too, but I hope the idea is clear.
I expect this to be an order of magnitude or so slower. (timeit shows 6ms for my "best" version vs 25ms for this) -- Regards, Ivan Pozdeev
Hi,
http://www.w3.org/TR/xpath/#section-Location-Steps says there are actually 3 steps of searching: axes, node-tests and predicates
1) as an axis (=outside square brackets and before path), "descendants::p" selects descendants of the currently considered set (initially, this is the root node), then subjects them to the node test ("p"=all <p> tags), so it would select all "p" nodes 2) as a predicate, however, it *selects the descendants of the currently considered node and checks if there's any that satisfies the node test*!
Maybe this is just expressed a bit mistakably, but: IMHO descendant::p does the same wherever applied. It selects the node set consisting of all nodes from the context node's descendant axis that fulfill the node test p.
Node test stage (http://www.w3.org/TR/xpath/#NT-NodeTest) has very limited functionality. In particular, it cannot use patterns except "*" (only single names), so we can't select both tags at the same time with it.
So, the expression you're looking for is either
(descendant::p|descendant::l)//*[@n=1]
(selects tags at axis stage) which would select all "p" tags, then all "l" tags, then iterate over all elements in their subtrees ("//*") and check every one for the condition, or
(//p|//l)//*[@n=1]
(1)
timeit.Timer("t.xpath('(//p|//l)//*[@n=1]')", setup='from __main__ import t').repeat() [42.82267999649048, 43.092921018600464, 43.01901602745056]
which is equivalent for our case: searching the spec for "//" finds http://www.w3.org/TR/xpath/#path-abbrev which says "// is short for /descendant-or-self::node()/", or
//*[(ancestor::p or ancestor::l) and @n=1]
(2)
timeit.Timer("t.xpath('//*[(ancestor::p or ancestor::l) and @n=1]')", setup='from __main__ import t').repeat() [52.73698687553406, 53.14989614486694, 52.96542406082153]
(selects tags at predicate stage) which would iterate over _all_ nodes ("//*" - subtree of root), and for every one check if it has "p" or "l" as an ancestor (which is much more work) etc, or
//*[name()='p' or name()='l']//*[@n=1]
(3)
timeit.Timer("""t.xpath('//*[name()="p" or name()="l"]//*[@n=1]')""", setup='from __main__ import t').repeat() [56.8190279006958, 56.754719972610474, 56.88860607147217]
to find both kinds of tags in one go and only then explore their subtrees - this looks like the most efficient way.
It's not, at least not for lxml/libxml2@2.9.1 according to timeit ;-) This will of course depend on the XPath implementation but my guess is that (3) is actually rather inefficient because the filter predicate [name()='p' or name()='l'] needs to get applied to all elements whereas for (1) no predicate evaluation is necessary for getting at the p and l elements - those are selected by a (union of) node set selection. I haven't looked at libxml2 sources, though.
"|" can be used in the 3rd case as well, but in a predicate, it can behave unpredictably:
In [125]: t.xpath("//*[(ancestor::p|@n=1)]") Out[125]: [<Element a at 0x163a8c8>, <Element b at 0x163a120>, <Element c at 0x163adf0>]
Wow, I didn't even know this is possible, same result with Xalan 2.7.1. So why does this not return /root/p/a[2]? Does this effectively translate to //*[(ancestor::p and @n=1) or @n=1] ?
In [126]: t.xpath("//*[(ancestor::p or @n=1)]") Out[126]: [<Element a at 0x163a8c8>, <Element a at 0x163afd0>, <Element b at 0x163a120>, <Element c at 0x163adf0>]
In [131]: t.xpath("//*[(@n=1|ancestor::p)]") XPathEvalError: Invalid type
Doesn't produce an error in Xalan 2.7.1, returns an empty result set.
..so you should better use "|" in a FilterExpr and "or" in a Predicate.
I somehow agree :-) but then again you could use it for things like //xs:element[@type=(//xs:simpleType/@name|//xs:complexType/@name)] which is analoguous to //xs:element[@type=//xs:simpleType/@name or @type=//xs:complexType/@name] So maybe "|" should be avoided as a "top-level" operator in predicates, for sanity reasons. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
Friday, November 27, 2015, 16:26:40 Holger:
Hi,
http://www.w3.org/TR/xpath/#section-Location-Steps says there are actually 3 steps of searching: axes, node-tests and predicates
1) as an axis (=outside square brackets and before path), "descendants::p" selects descendants of the currently considered set (initially, this is the root node), then subjects them to the node test ("p"=all <p> tags), so it would select all "p" nodes 2) as a predicate, however, it *selects the descendants of the currently considered node and checks if there's any that satisfies the node test*!
Maybe this is just expressed a bit mistakably, but:
IMHO descendant::p does the same wherever applied. It selects the node set consisting of all nodes from the context node's descendant axis that fulfill the node test p.
Of course, the basic operation is the same in both cases. But, due to different contexts, it has opposite effects! Which is far from being obvious. That's what I wanted to stress.
Node test stage (http://www.w3.org/TR/xpath/#NT-NodeTest) has very limited functionality. In particular, it cannot use patterns except "*" (only single names), so we can't select both tags at the same time with it.
So, the expression you're looking for is either
(descendant::p|descendant::l)//*[@n=1]
(selects tags at axis stage) which would select all "p" tags, then all "l" tags, then iterate over all elements in their subtrees ("//*") and check every one for the condition, or
(//p|//l)//*[@n=1]
(1)
timeit.Timer("t.xpath('(//p|//l)//*[@n=1]')", setup='from __main__ import t').repeat() [42.82267999649048, 43.092921018600464, 43.01901602745056]
which is equivalent for our case: searching the spec for "//" finds http://www.w3.org/TR/xpath/#path-abbrev which says "// is short for /descendant-or-self::node()/", or
//*[(ancestor::p or ancestor::l) and @n=1]
(2)
timeit.Timer("t.xpath('//*[(ancestor::p or ancestor::l) and @n=1]')", setup='from __main__ import t').repeat() [52.73698687553406, 53.14989614486694, 52.96542406082153]
(selects tags at predicate stage) which would iterate over _all_ nodes ("//*" - subtree of root), and for every one check if it has "p" or "l" as an ancestor (which is much more work) etc, or
//*[name()='p' or name()='l']//*[@n=1]
(3)
timeit.Timer("""t.xpath('//*[name()="p" or name()="l"]//*[@n=1]')""", setup='from __main__ import t').repeat() [56.8190279006958, 56.754719972610474, 56.88860607147217]
to find both kinds of tags in one go and only then explore their subtrees - this looks like the most efficient way.
It's not, at least not for lxml/libxml2@2.9.1 according to timeit ;-)
This will of course depend on the XPath implementation but my guess is that (3) is actually rather inefficient because the filter predicate [name()='p' or name()='l'] needs to get applied to all elements whereas for (1) no predicate evaluation is necessary for getting at the p and l elements - those are selected by a (union of) node set selection.
Good catch. Apparently, node test is so much simpler than predicate test it outerperforms an additional tree walk. Will they tie at some point because of this? In [24]: timeit t.xpath('.') 10000 loops, best of 3: 122 us per loop In [29]: timeit t.xpath('//x') 10000 loops, best of 3: 130 us per loop In [25]: timeit t.xpath('//p') 10000 loops, best of 3: 131 us per loop In [31]: timeit t.xpath('//*') 10000 loops, best of 3: 144 us per loop In [30]: timeit t.xpath('//*[name()="x"]') 1000 loops, best of 3: 233 us per loop In [32]: timeit t.xpath('//*[name()="p"]') 1000 loops, best of 3: 235 us per loop So, 122 = setup + 1 result 8 = tree walk + node test | => ~1/item = result composition 9 = tree walk + node test + 1 result | | => 4 = node test/10 nodes 14 = tree walk + 10 results | | | => 4 = tree walk/10 nodes 103 - tree walk + predicate test => ~100 = predicate test/10 nodes (I'm neglecting text nodes: nodes="effective nodes") Assuming tree walk, node test and predicate test times are proportional to the number of nodes, they'll tie at 121+(0.4N+0.4N)*2 = 121 + (0.4N + 100N) never.
I haven't looked at libxml2 sources, though.
"|" can be used in the 3rd case as well, but in a predicate, it can behave unpredictably:
In [125]: t.xpath("//*[(ancestor::p|@n=1)]") Out[125]: [<Element a at 0x163a8c8>, <Element b at 0x163a120>, <Element c at 0x163adf0>]
Wow, I didn't even know this is possible, same result with Xalan 2.7.1. So why does this not return /root/p/a[2]?
Does this effectively translate to
//*[(ancestor::p and @n=1) or @n=1]
?
In [126]: t.xpath("//*[(ancestor::p or @n=1)]") Out[126]: [<Element a at 0x163a8c8>, <Element a at 0x163afd0>, <Element b at 0x163a120>>, <Element c at 0x163adf0>]
In [131]: t.xpath("//*[(@n=1|ancestor::p)]") XPathEvalError: Invalid type
Doesn't produce an error in Xalan 2.7.1, returns an empty result set.
..so you should better use "|" in a FilterExpr and "or" in a Predicate.
I somehow agree :-) but then again you could use it for things like
//xs:element[@type=(//xs:simpleType/@name|//xs:complexType/@name)]
which is analoguous to
//xs:element[@type=//xs:simpleType/@name or @type=//xs:complexType/@name]
So maybe "|" should be avoided as a "top-level" operator in predicates, for sanity reasons.
Holger
Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Regards, Ivan Pozdeev
participants (4)
-
Holger Joukl
-
Ivan Pozdeev
-
Thibault Clerice
-
Thomas Schraitle