
Friday, November 27, 2015, 16:26:40 Holger:
Hi,
http://www.w3.org/TR/xpath/#section-Location-Steps says there are actually 3 steps of searching: axes, node-tests and predicates
1) as an axis (=outside square brackets and before path), "descendants::p" selects descendants of the currently considered set (initially, this is the root node), then subjects them to the node test ("p"=all <p> tags), so it would select all "p" nodes 2) as a predicate, however, it *selects the descendants of the currently considered node and checks if there's any that satisfies the node test*!
Maybe this is just expressed a bit mistakably, but:
IMHO descendant::p does the same wherever applied. It selects the node set consisting of all nodes from the context node's descendant axis that fulfill the node test p.
Of course, the basic operation is the same in both cases. But, due to different contexts, it has opposite effects! Which is far from being obvious. That's what I wanted to stress.
Node test stage (http://www.w3.org/TR/xpath/#NT-NodeTest) has very limited functionality. In particular, it cannot use patterns except "*" (only single names), so we can't select both tags at the same time with it.
So, the expression you're looking for is either
(descendant::p|descendant::l)//*[@n=1]
(selects tags at axis stage) which would select all "p" tags, then all "l" tags, then iterate over all elements in their subtrees ("//*") and check every one for the condition, or
(//p|//l)//*[@n=1]
(1)
timeit.Timer("t.xpath('(//p|//l)//*[@n=1]')", setup='from __main__ import t').repeat() [42.82267999649048, 43.092921018600464, 43.01901602745056]
which is equivalent for our case: searching the spec for "//" finds http://www.w3.org/TR/xpath/#path-abbrev which says "// is short for /descendant-or-self::node()/", or
//*[(ancestor::p or ancestor::l) and @n=1]
(2)
timeit.Timer("t.xpath('//*[(ancestor::p or ancestor::l) and @n=1]')", setup='from __main__ import t').repeat() [52.73698687553406, 53.14989614486694, 52.96542406082153]
(selects tags at predicate stage) which would iterate over _all_ nodes ("//*" - subtree of root), and for every one check if it has "p" or "l" as an ancestor (which is much more work) etc, or
//*[name()='p' or name()='l']//*[@n=1]
(3)
timeit.Timer("""t.xpath('//*[name()="p" or name()="l"]//*[@n=1]')""", setup='from __main__ import t').repeat() [56.8190279006958, 56.754719972610474, 56.88860607147217]
to find both kinds of tags in one go and only then explore their subtrees - this looks like the most efficient way.
It's not, at least not for lxml/libxml2@2.9.1 according to timeit ;-)
This will of course depend on the XPath implementation but my guess is that (3) is actually rather inefficient because the filter predicate [name()='p' or name()='l'] needs to get applied to all elements whereas for (1) no predicate evaluation is necessary for getting at the p and l elements - those are selected by a (union of) node set selection.
Good catch. Apparently, node test is so much simpler than predicate test it outerperforms an additional tree walk. Will they tie at some point because of this? In [24]: timeit t.xpath('.') 10000 loops, best of 3: 122 us per loop In [29]: timeit t.xpath('//x') 10000 loops, best of 3: 130 us per loop In [25]: timeit t.xpath('//p') 10000 loops, best of 3: 131 us per loop In [31]: timeit t.xpath('//*') 10000 loops, best of 3: 144 us per loop In [30]: timeit t.xpath('//*[name()="x"]') 1000 loops, best of 3: 233 us per loop In [32]: timeit t.xpath('//*[name()="p"]') 1000 loops, best of 3: 235 us per loop So, 122 = setup + 1 result 8 = tree walk + node test | => ~1/item = result composition 9 = tree walk + node test + 1 result | | => 4 = node test/10 nodes 14 = tree walk + 10 results | | | => 4 = tree walk/10 nodes 103 - tree walk + predicate test => ~100 = predicate test/10 nodes (I'm neglecting text nodes: nodes="effective nodes") Assuming tree walk, node test and predicate test times are proportional to the number of nodes, they'll tie at 121+(0.4N+0.4N)*2 = 121 + (0.4N + 100N) never.
I haven't looked at libxml2 sources, though.
"|" can be used in the 3rd case as well, but in a predicate, it can behave unpredictably:
In [125]: t.xpath("//*[(ancestor::p|@n=1)]") Out[125]: [<Element a at 0x163a8c8>, <Element b at 0x163a120>, <Element c at 0x163adf0>]
Wow, I didn't even know this is possible, same result with Xalan 2.7.1. So why does this not return /root/p/a[2]?
Does this effectively translate to
//*[(ancestor::p and @n=1) or @n=1]
?
In [126]: t.xpath("//*[(ancestor::p or @n=1)]") Out[126]: [<Element a at 0x163a8c8>, <Element a at 0x163afd0>, <Element b at 0x163a120>>, <Element c at 0x163adf0>]
In [131]: t.xpath("//*[(@n=1|ancestor::p)]") XPathEvalError: Invalid type
Doesn't produce an error in Xalan 2.7.1, returns an empty result set.
..so you should better use "|" in a FilterExpr and "or" in a Predicate.
I somehow agree :-) but then again you could use it for things like
//xs:element[@type=(//xs:simpleType/@name|//xs:complexType/@name)]
which is analoguous to
//xs:element[@type=//xs:simpleType/@name or @type=//xs:complexType/@name]
So maybe "|" should be avoided as a "top-level" operator in predicates, for sanity reasons.
Holger
Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Regards, Ivan Pozdeev