
Thursday, November 26, 2015, 11:47:00 Thomas:
Hi Thibault,
On Thu, 26 Nov 2015 09:16:13 +0100 Thibault Clerice <leponteineptique@gmail.com> wrote:
Thanks for the explanation, I tried to find the documentation about XPath 1 and Xpath 2 to ensure this was not due to that, but could not find the needle in the haystack.
Me neither.
Needle in a haystack, hah! Use a magnet! 1. Googling for "xpath specification" gives: http://www.w3.org/TR/xpath/, http://www.w3.org/TR/xpath20/ The specs include the complete formal grammar. Each grammar section is preceeded by explanation of the corresponding semantics. In the W3 docs, they even make each mention of an entity in a grammar expression a link to its definition for convenient browsing! (the scientific term for the "entities" is "nonterminal symbols", so the anchors start with "NT") 2. searching the 1.0 spec for "|" shows it's used in formal grammar syntax, and the literals in it are decorated like "'<character>'" 3. searching for "'|'" leads to http://www.w3.org/TR/xpath/#NT-UnionExpr which says it's a "union operator" that "computes the union of its operands, which must be node-sets". 4. http://www.w3.org/TR/xpath/#NT-PathExpr just below shows you can also _precede_ a path with a "filter expression" 5. Looking at Predicate (=square-bracketed expression as per http://www.w3.org/TR/xpath/#NT-Predicate) definition and futher along the links, leads to http://www.w3.org/TR/xpath/#NT-OrExpr which indeed is the proper one for a logical OR in a predicate (predicate = a test performed on every item in a set). UnionExpr can be part of it, too, through UnaryExpr. Tests show the distinction is: "|" returns a set while "or" returns a boolean: In [100]: t.xpath("(descendant::l|descendant::p)") Out[100]: [<Element p at 0x16890f8>, <Element l at 0x1689058>] In [101]: t.xpath("(descendant::l or descendant::p)") Out[101]: True 6. "(descendant::p|descendant::l)//*[@n=1]" on the attached file gives proper results. But, upon moving it inside the predicate, nothing is returned. Strange. "(descendant::p|descendant::l)" gives a strange result: only the "p" and "l" tags. What's up with that? Hmm... why not try it the other way round and see what happens? In [54]: t.xpath("//*[(ancestor::p)]") Out[54]: [<Element a at 0x1548440>, <Element a at 0x16808a0>] Eureka! "descendant::p" does not mean "descendants of p"!! http://www.w3.org/TR/xpath/#section-Location-Steps says there are actually 3 steps of searching: axes, node-tests and predicates 1) as an axis (=outside square brackets and before path), "descendants::p" selects descendants of the currently considered set (initially, this is the root node), then subjects them to the node test ("p"=all <p> tags), so it would select all "p" nodes 2) as a predicate, however, it *selects the descendants of the currently considered node and checks if there's any that satisfies the node test*! Node test stage (http://www.w3.org/TR/xpath/#NT-NodeTest) has very limited functionality. In particular, it cannot use patterns except "*" (only single names), so we can't select both tags at the same time with it. So, the expression you're looking for is either (descendant::p|descendant::l)//*[@n=1] (selects tags at axis stage) which would select all "p" tags, then all "l" tags, then iterate over all elements in their subtrees ("//*") and check every one for the condition, or (//p|//l)//*[@n=1] which is equivalent for our case: searching the spec for "//" finds http://www.w3.org/TR/xpath/#path-abbrev which says "// is short for /descendant-or-self::node()/", or //*[(ancestor::p or ancestor::l) and @n=1] (selects tags at predicate stage) which would iterate over _all_ nodes ("//*" - subtree of root), and for every one check if it has "p" or "l" as an ancestor (which is much more work) etc, or //*[name()='p' or name()='l']//*[@n=1] to find both kinds of tags in one go and only then explore their subtrees - this looks like the most efficient way. "|" can be used in the 3rd case as well, but in a predicate, it can behave unpredictably: In [125]: t.xpath("//*[(ancestor::p|@n=1)]") Out[125]: [<Element a at 0x163a8c8>, <Element b at 0x163a120>, <Element c at 0x163adf0>] In [126]: t.xpath("//*[(ancestor::p or @n=1)]") Out[126]: [<Element a at 0x163a8c8>, <Element a at 0x163afd0>, <Element b at 0x163a120>, <Element c at 0x163adf0>] In [131]: t.xpath("//*[(@n=1|ancestor::p)]") XPathEvalError: Invalid type ..so you should better use "|" in a FilterExpr and "or" in a Predicate. The precise rules for handling sets in predicates are outlined in http://www.w3.org/TR/xpath/#booleans and are rather convoluted, so take Python Zen to heart and don't mix sets with booleans if you can do without. 7. Searching for "'|'" in the 2.0 spec gives nothing, neither does searching for "union" yield any references to a "union operation". so '|' is indeed specific to 1.0 . P.S. I didn't know any of this before composing the letter.
Unfortunately, I cannot use the syntax you gave, but I found two alternatives, one being given to me by someone on the list in a private mail (Jamie):
- //div//*[local-name()='p' or local-name()='l'][@n='1'] - //div//*[lself::ns:p or self::ns:l][@n='1']
Just to make it clear: the first expression matches *all* 'p' and 'l' elements regardless to which namespace they belong. This may or may not what you want.
Probably this works in most cases, but you should be aware of that if you have a mixture of different elements from different namespaces this will be an issue.
Work both terribly well. In case anyone is like me, stuck with not so much ability to go the /a/a | /a/b road.
Another solution would be to avoid XPath and use lxml/Python. You can iterate through your tree, maybe something like this:
NS = "{YOUR_NAMESPACE}" tree = etree.parse("some.xml") root = tree.getroot() alldivs = [i for i in root.iterdescendants("{}div".format(NS)) ] all_l_p = [i for d in alldivs for i in d.iterdescendants() \ if i.tag in ("{}p".format(NS), "{}l".format(NS)) ]
Of course you need to filter the additional 'n' attribute too, but I hope the idea is clear.
I expect this to be an order of magnitude or so slower. (timeit shows 6ms for my "best" version vs 25ms for this) -- Regards, Ivan Pozdeev