[lxml-dev] absolute XPath expressions on Elements
Hi, I implemented a getpath() method for the ElementTree class that returns an XPath expression for a node. While working out test cases for it, however, I realized that the semantics of evaluating absolute XPath expressions (/...) on elements were not clear at all in the current implementation. ET does not allow absolute expressions in Element.findall() and raises a SyntaxError instead. I think we should do the same for Element.xpath() to prevent mistakes like this:
a = etree.Element("a") b = etree.SubElement(a, "b") d0 = etree.SubElement(b, "d") c = etree.SubElement(a, "c") d1 = etree.SubElement(c, "d") d2 = etree.SubElement(c, "d")
c.xpath("//d")
The reasoning is that Elements do not have a root and therefore no absolute starting point for XPath. Only ElementTrees provide the required semantics, so it's perfectly valid to do this instead:
ElementTree(c).xpath("//d")
Imagine the case where you have many ElementTrees wrapping various elements in a tree. Which one should be the starting point? Remember that documents and their absolute root node are not exposed through the API. The use case that brought me there is this:
tree = etree.ElementTree(c) print tree.getpath(d2) /c/d[2] tree.xpath(tree.getpath(d2)) == [d2] # fails!
Intuitively, this should work. However, the current implementation fails here, as it starts searching at 'a' rather than 'c' and thus finds nothing. To fix this, we have to switch the root node during XPath evaluation. Doing this for ElementTree.xpath() is ok, but doing this for Element.xpath() also is impossible, as it breaks relative expressions like "..". So I decided to simply special case XPath expressions starting with '/' and raise exceptions for them. I know that this is not sufficient, as absolute paths can be hidden in things like "*[/a]" or "a|/a". But it's hopefully enough to make users aware and to prevent common mistakes. I also added a note in the documentation that the result of absolute expressions is undefined for Elements, I think that's the right way of saying it. I post this here as there will likely be code that breaks because of this change. I already found two test cases in the test suite that used this. It's just too easy to get wrong, so lxml is better off by raising exceptions where it can than just ignoring this problem. Stefan
Stefan Behnel wrote:
I implemented a getpath() method for the ElementTree class that returns an XPath expression for a node. While working out test cases for it, however, I realized that the semantics of evaluating absolute XPath expressions (/...) on elements were not clear at all in the current implementation.
ET does not allow absolute expressions in Element.findall() and raises a SyntaxError instead. I think we should do the same for Element.xpath() to prevent mistakes like this:
a = etree.Element("a") b = etree.SubElement(a, "b") d0 = etree.SubElement(b, "d") c = etree.SubElement(a, "c") d1 = etree.SubElement(c, "d") d2 = etree.SubElement(c, "d")
c.xpath("//d")
The reasoning is that Elements do not have a root and therefore no absolute starting point for XPath. Only ElementTrees provide the required semantics, so it's perfectly valid to do this instead:
ElementTree(c).xpath("//d")
Imagine the case where you have many ElementTrees wrapping various elements in a tree. Which one should be the starting point? Remember that documents and their absolute root node are not exposed through the API.
Maybe we should expose the absolute root of documents in the API? XPath is defined on the document level. We could define the xpath() function to work in the context of the underlying document when / is used. Conceptually for XPath there *is* an underlying document with a certain structure. We can try to paper that over with hacky XPath parsing and exceptions and pretend there is not, but it's going to lead to more confusion than just exposing this concept in the API.
The use case that brought me there is this:
tree = etree.ElementTree(c) print tree.getpath(d2) /c/d[2] tree.xpath(tree.getpath(d2)) == [d2] # fails!
Intuitively, this should work. However, the current implementation fails here, as it starts searching at 'a' rather than 'c' and thus finds nothing. To fix this, we have to switch the root node during XPath evaluation. Doing this for ElementTree.xpath() is ok, but doing this for Element.xpath() also is impossible, as it breaks relative expressions like "..".
But c isn't the root of the tree in all this. I think again it would be much better if we exposed the real underlying tree here, and only return xpath expressions generated from the real root.
So I decided to simply special case XPath expressions starting with '/' and raise exceptions for them. I know that this is not sufficient, as absolute paths can be hidden in things like "*[/a]" or "a|/a". But it's hopefully enough to make users aware and to prevent common mistakes. I also added a note in the documentation that the result of absolute expressions is undefined for Elements, I think that's the right way of saying it.
I think that instead of going this way, we need to step back for a minute. libxml2 has documents with trees. ElementTree has, potentially, as many trees as there are nodes. xpath works on libxml2 documents. The libxml2 story is going to leak into the ElementTree abstraction inevitably - such as your expressions *[/a], and so on. I think instead of trying to protect the ElementTree abstraction by incomplete checks to prevent 'common mistakes', we need to rethink what we want to expose in the lxml abstractions in the first place.
I post this here as there will likely be code that breaks because of this change. I already found two test cases in the test suite that used this. It's just too easy to get wrong, so lxml is better off by raising exceptions where it can than just ignoring this problem.
I don't consider this code to be wrong. That's why we had cases in the test suite that tested for it. Since then you reworked the code to be more like ElementTree in the usage of the ElementTree class, but this stuff is going to shine through nonetheless. Can't we expose a method getdocument() on Elements which will expose the underlying document as an ElementTree instance, and then define XPath's / to work from that always? We can then clearly define xpath() and getpath() in terms of getdocument(). Of course the behavior of getdocument() may be hard to predict for a user. Is this really true, or is getdocument() always going to be the thing created with Element() that wasn't appended or otherwise placed under another one? We have a getparent() method too after all, so we're hardly hiding the existence of the true libxml2 document in our abstraction. Regards, Martijn
Hi Martijn, Martijn Faassen wrote:
Stefan Behnel wrote:
Imagine the case where you have many ElementTrees wrapping various elements in a tree. Which one should be the starting point? Remember that documents and their absolute root node are not exposed through the API.
Maybe we should expose the absolute root of documents in the API?
I don't think this helps. We have ElementTrees that already fulfil exactly the need of representing rooted XML trees. And having ElementTrees that are mostly like other ElementTrees except that they always reference a special element in the document that potentially is not in other ElementTrees of the same document but can be referenced by et.xpath() from any of them ... I don't think that makes things more understandable. You shouldn't forget that you can always append the context node of an ElementTree to another element. Is this supposed to change the result of an xpath() call on the unmodified ElementTree? This would introduce some really hard to debug side-effects. So, in a way, it introduces unpredictable behaviour either way. It's just that the transition gets it closer to the ElementTree API.
tree = etree.ElementTree(c) print tree.getpath(d2) /c/d[2] tree.xpath(tree.getpath(d2)) == [d2] # fails!
Intuitively, this should work. However, the current implementation fails here, as it starts searching at 'a' rather than 'c' and thus finds nothing. To fix this, we have to switch the root node during XPath evaluation. Doing this for ElementTree.xpath() is ok, but doing this for Element.xpath() also is impossible, as it breaks relative expressions like "..".
But c isn't the root of the tree in all this.
Well, it is the root of the ElementTree object. When I call xpath() on that tree, I really expect the root of the tree to be the reference point for absolute expressions.
libxml2 has documents with trees. ElementTree has, potentially, as many trees as there are nodes. xpath works on libxml2 documents. The libxml2 story is going to leak into the ElementTree abstraction inevitably - such as your expressions *[/a], and so on.
But that expression only leaks on Elements. It works as expected on ElementTrees.
I think instead of trying to protect the ElementTree abstraction by incomplete checks to prevent 'common mistakes', we need to rethink what we want to expose in the lxml abstractions in the first place.
All I'm saying is that absolute expressions on Elements do not make sense anyway, so we should clearly mark them as invalid and do our best to prevent their use. If some of them leak, that's mainly for performance reasons.
I post this here as there will likely be code that breaks because of this change. I already found two test cases in the test suite that used this. It's just too easy to get wrong, so lxml is better off by raising exceptions where it can than just ignoring this problem.
I don't consider this code to be wrong. That's why we had cases in the test suite that tested for it. Since then you reworked the code to be more like ElementTree in the usage of the ElementTree class, but this stuff is going to shine through nonetheless.
Can't we expose a method getdocument() on Elements which will expose the underlying document as an ElementTree instance, and then define XPath's / to work from that always? We can then clearly define xpath() and getpath() in terms of getdocument().
Originally, I implemented getpath() as Element.getpath(). I revoked that because it doesn't make sense in the context of the ElementTree API. It only makes sense when you have an ElementTree that you refer to. So, now the call is ElementTree.getpath(element). I think it's the same for absolute expressions in xpath(). They just don't make sense on Elements.
Of course the behavior of getdocument() may be hard to predict for a user. Is this really true, or is getdocument() always going to be the thing created with Element() that wasn't appended or otherwise placed under another one? We have a getparent() method too after all, so we're hardly hiding the existence of the true libxml2 document in our abstraction.
We have getparent() because we do not allow Elements to have multiple parents. However, we do allow trees (or documents) to have multiple root contexts (via ElementTree). Everything in lxml works with ElementTrees by now and uses the correct context node when you pass one in. This includes XSLT, RelaxNG, XMLSchema and the XPath class. I don't see why xpath() should be the only exception. Stefan
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> writes:
I think it's the same for absolute expressions in xpath(). They just don't make sense on Elements.
Why? Please, look a bit closer to XPath expressions and what you can do with them. You have things like axes. You can search to many other directions too than just to children. To make most use from a XPath it needs to have some context node AND some root. How can you give the context node to the xpath evaluation, if the method is in the document side?
From my point of view the same xpath method needs to be able to evaluate both absolute and relative expressions.
Think about implementing something like XSLT, we define blocks that get a context node. Then from those blocks we can access the whole document both with absolute and relative expressions with the same method. It just needs to work and it just needs to know both the root and the context node. -- Ilpo Nyyssönen # biny # /* :-) */
Hi Ilpo, Ilpo Nyyssönen wrote:
Stefan Behnel writes:
I think it's the same for absolute expressions in xpath(). They just don't make sense on Elements.
Why? Please, look a bit closer to XPath expressions and what you can do with them. You have things like axes. You can search to many other directions too than just to children.
Sure, that's relative expressions, which are perfectly fine in the context of elements. If you read my post, you will see that this was one of my concerns.
To make most use from a XPath it needs to have some context node AND some root. How can you give the context node to the xpath evaluation, if the method is in the document side?
What do you mean? You either have a relative expression in which case you have a context node. Or it's an absolute expression in which case it does not have a context node. In the first case, call it either on an Element or ElementTree. In the second case, call it on an ElementTree.
From my point of view the same xpath method needs to be able to evaluate both absolute and relative expressions.
Then tell me: what does it mean to evaluate an absolute XPath expression against an element? What is the point in having a context node in that case? Can you come up with an absolute XPath expression that references a context node?
Think about implementing something like XSLT, we define blocks that get a context node. Then from those blocks we can access the whole document both with absolute and relative expressions with the same method.
It just needs to work and it just needs to know both the root and the context node.
But then why would you want to call the absolute expression on the context node? What's wrong with evaluating it against some ElementTree that represents the entire document? Sorry, I'm a little confused. Could you go into some more detail with your arguments? Stefan
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> writes:
What do you mean? You either have a relative expression in which case you have a context node. Or it's an absolute expression in which case it does not have a context node.
In the first case, call it either on an Element or ElementTree. In the second case, call it on an ElementTree.
So as I don't know whether it is relative or absolute (it was given to me by someone else via API), I need to evaluate it always in ElementTree? How does the ElementTree know the context node? Also, if I currently only pass Element to a method, where does it get the ElementTree? Or are you saying that I should pass both?
Then tell me: what does it mean to evaluate an absolute XPath expression against an element?
The same as it would be to evaluate it in the document the element belongs to.
What is the point in having a context node in that case? Can you come up with an absolute XPath expression that references a context node?
It is not about it using it. It is about generic interface. I want to evaluate XPath expressions and I don't want to start looking whether those are relative or absolute. -- Ilpo Nyyssönen # biny # /* :-) */
Hi Ilpo, Ilpo Nyyssönen wrote:
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> writes:
What do you mean? You either have a relative expression in which case you have a context node. Or it's an absolute expression in which case it does not have a context node.
In the first case, call it either on an Element or ElementTree. In the second case, call it on an ElementTree.
So as I don't know whether it is relative or absolute (it was given to me by someone else via API), I need to evaluate it always in ElementTree?
That was the idea, yes. I admit that it may be tricky to figure out the difference if you can't control the source of XPath expressions.
Then tell me: what does it mean to evaluate an absolute XPath expression against an element?
The same as it would be to evaluate it in the document the element belongs to.
That doesn't make sense in the ElementTree API. Elements do not have a root except for themselves.
What is the point in having a context node in that case? Can you come up with an absolute XPath expression that references a context node?
It is not about it using it. It is about generic interface. I want to evaluate XPath expressions and I don't want to start looking whether those are relative or absolute.
Ok, I get your point. Actually, it's already changed in the trunk. I implemented the what Martijn proposed. We now have a "getroottree()" method on elements that returns an ElementTree for the root of the document that the element is in. We then define the evaluation of absolute expressions against elements as an evaluation against this elementtree. This is a sensible extension to the API that makes sense in the context of lxml/libxml2. Stefan
OK, and I need to thank you for changing the API back. Now you can add the getpath to the Element too? -- Ilpo Nyyssönen # biny # /* :-) */
Hi Ilpo, Ilpo Nyyssönen wrote:
OK, and I need to thank you for changing the API back. Now you can add the getpath to the Element too?
You can read the doc section describing XPath support here: http://codespeak.net/svn/lxml/trunk/doc/api.txt Stefan
Hi Martijn, Martijn Faassen wrote:
Can't we expose a method getdocument() on Elements which will expose the underlying document as an ElementTree instance
I though about this some more. I'm not opposed to this idea. It makes sense in the context of libxml2. It's well defined and matches the getparent() method. I personally prefer a name like "getroottree()", as "document" is not used in the API so far.
, and then define XPath's / to work from that always? We can then clearly define xpath() and getpath() in terms of getdocument().
Not getpath(), which only works on ElementTrees anyway. This only regards Element.xpath() then. ElementTree.xpath() will continue to switch root nodes, whereas Element.xpath() will use the element as context for relative expressions and the root tree as context for absolute expressions.
Of course the behavior of getdocument() may be hard to predict for a user. Is this really true, or is getdocument() always going to be the thing created with Element() that wasn't appended or otherwise placed under another one?
"element.getroottree()" will always return an ElementTree rooted in the root node of the document that contains the element. How is that for a definition? Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
Can't we expose a method getdocument() on Elements which will expose the underlying document as an ElementTree instance
I though about this some more. I'm not opposed to this idea. It makes sense in the context of libxml2. It's well defined and matches the getparent() method.
I personally prefer a name like "getroottree()", as "document" is not used in the API so far.
Heh, I was just checking this thread again and prepared to argue some more, but I'm glad to see I don't need to. :) Great!
, and then define XPath's / to work from that always? We can then clearly define xpath() and getpath() in terms of getdocument().
Not getpath(), which only works on ElementTrees anyway.
Right, that makes sense.
This only regards Element.xpath() then. ElementTree.xpath() will continue to switch root nodes, whereas Element.xpath() will use the element as context for relative expressions and the root tree as context for absolute expressions.
Understood.
Of course the behavior of getdocument() may be hard to predict for a user. Is this really true, or is getdocument() always going to be the thing created with Element() that wasn't appended or otherwise placed under another one?
"element.getroottree()" will always return an ElementTree rooted in the root node of the document that contains the element.
How is that for a definition?
Sounds fine. You could also define 'root node' a bit better by saying "it's what you get when you walk up the parent chain". I was more thinking along the lines how complex it is for the programmer to reason about the tree this way. I think it's not too difficult to identify the root node. Of course, ElementTree proper doesn't have such a concept really, but we definitely do and it shows up in quite a few places. Regards, Martijn
participants (3)
-
iny+news@iki.fi
-
Martijn Faassen
-
Stefan Behnel