[lxml-dev] Element.getnext() and Element.getprevious() ?

Hi all, when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading. It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level... I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely. Stefan

On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote:
Hi all,
when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading.
It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level...
I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely.
It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes. I'd propose to create properties, which act like lists. So the following would be correct:
node.following_sibling[0] <some node bla-bla..>
there could be exception for parent node, as there couldn't be more than one parent.

Andrey Tatarinov wrote:
On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote:
Hi all,
when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading.
It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level...
I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely.
It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes.
I'd propose to create properties, which act like lists. So the following would be correct:
node.following_sibling[0] <some node bla-bla..>
there could be exception for parent node, as there couldn't be more than one parent.
I don't consider this to be easier to understand though. getnext() and getprevious() tend to be easier to grasp. I mean, I'm sure XPath axes are nice to use occasionally, and have some conceptual attraction, but if you want to use those, why not just use XPath? Regards, Martijn

On Fri, 2006-06-02 at 14:02 +0200, Martijn Faassen wrote:
Andrey Tatarinov wrote:
On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote:
Hi all,
when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading.
It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level...
I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely.
It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes.
I'd propose to create properties, which act like lists. So the following would be correct:
node.following_sibling[0] <some node bla-bla..>
there could be exception for parent node, as there couldn't be more than one parent.
I don't consider this to be easier to understand though. getnext() and getprevious() tend to be easier to grasp. I mean, I'm sure XPath axes are nice to use occasionally, and have some conceptual attraction, but if you want to use those, why not just use XPath?
There is such a thing as 'consistency'. As we are working in domain of XML manipulation and there is already well-thought-of dictionary of terms and definitions, well-thought language, we should adopt it as much as possible. It's like Occam's razor. (I know really well, that term consistency is very important, cause at the moment I'm working on a huge system that lacks it. often it's really hard to understand what is meant by this or that word) Thus introducing new naming scheme (that is not thought through at all) is a bad thing. All that ElementTree/lxml is about is thin and intuitive wrapper of XML domain for python. I think that it should not be forgotten.

Hi Andrey, Andrey Tatarinov wrote:
Thus introducing new naming scheme (that is not thought through at all) is a bad thing.
I personally find element.getnext() and element.getprevious() *very* intuitive, given the already existing getparent(), getchildren() and getrootnode(). Maybe getiterator() is a bit less intuitive, as it doesn't tell you what it iterates over, but then again, when you know elements iterate over their own children, it becomes close-to-intuitive that getiterator() does the other thing, you know, iterate over the elements in the tree. It's the same with element.getnext(). I'd just go: "Can't be children as there's getchildren() for that. Can't be the parent, as it wouldn't make sense and there's getparent() for that. So I guess it's the siblings. And getprevious() matches it, obviously." That's what I mean with intuitive. Stefan

Hi Martijn, Martijn Faassen wrote:
Andrey Tatarinov wrote:
node.following_sibling[0]
I don't consider this to be easier to understand though. getnext() and getprevious() tend to be easier to grasp.
I think so, too. Another thing would be "itersiblings()" to match iter(element), similar in naming to what the Python container classes (most notably dict) do. I think that would also make a nice companion. Something like: def itersiblings(self, preceding=False): ... to reach both directions. Stefan

Stefan Behnel wrote:
Martijn Faassen wrote:
Andrey Tatarinov wrote:
node.following_sibling[0] I don't consider this to be easier to understand though. getnext() and getprevious() tend to be easier to grasp.
I think so, too. Another thing would be "itersiblings()" to match iter(element), similar in naming to what the Python container classes (most notably dict) do. I think that would also make a nice companion. Something like:
def itersiblings(self, preceding=False): ...
Hmm, now that I think about it, we'd then also want iterparents(), right? But then it's really the question if we use iterparents() or iterancestors(). There is only one parent, but many siblings, so iterparents() is not really the right idea... I'll leave it out for now, until someone has a good argument for either of the two. Stefan

Hi Andrey, Andrey Tatarinov wrote:
On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote:
when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading.
It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level...
I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely.
It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes.
I know. I thought about that, too. But I didn't want to add "sibling" to make it longer without making it clearer.
I'd propose to create properties, which act like lists. So the following would be correct:
node.following_sibling[0] <some node bla-bla..>
Ok, let's walk that through. Here are the other axes and their current API: * ancestor - subsequent calls to getparent() * child - element[i] or getchildren() * descendant - getiterator() * following - ? * following-sibling - ? * parent - getparent() * preceding - ? * preceding-sibling - ? So all that's currently missing is really the sibling stuff. However, your above proposal would also encourage an ancestor 'list'. Note also that the preceding axis is rather tricky (and rarely used IMHO), so it's rather unlikely it will make it into the API. The following axis, on the other hand, can be seen as a combination of getnext() and getiterator(), so that's covered by adding a getnext(). I'm a bit opposed to the list idea, as it is not very explicit. Just for performance, how would you distinguish between these two from the point of view of the property itself:
element.following_sibling element.following_sibling[0]
Should we always build a list of all siblings for both cases? Also, it doesn't match the getchildren() API call (which /is/ explicit). So, if we follow the axis naming exactly, all that is really missing is getfollowingsibling() and getprecedingsibling(). Now, those two are rather hard to read, but getfollowing() and getpreceding() are just wrong in terms of XPath. So, I still prefer getnext() and getprevious(). Stefan

On Fri, 2006-06-02 at 14:17 +0200, Stefan Behnel wrote:
Hi Andrey,
Andrey Tatarinov wrote:
On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote:
when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading.
It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level...
I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely.
It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes.
I know. I thought about that, too. But I didn't want to add "sibling" to make it longer without making it clearer.
I'd propose to create properties, which act like lists. So the following would be correct:
node.following_sibling[0] <some node bla-bla..>
Ok, let's walk that through. Here are the other axes and their current API:
* ancestor - subsequent calls to getparent() * child - element[i] or getchildren() * descendant - getiterator() * following - ? * following-sibling - ? * parent - getparent() * preceding - ? * preceding-sibling - ?
sorry, but it's a mess
So all that's currently missing is really the sibling stuff. However, your above proposal would also encourage an ancestor 'list'. Note also that the preceding axis is rather tricky (and rarely used IMHO), so it's rather unlikely it will make it into the API. The following axis, on the other hand, can be seen as a combination of getnext() and getiterator(), so that's covered by adding a getnext().
I'm a bit opposed to the list idea, as it is not very explicit. Just for performance, how would you distinguish between these two from the point of view of the property itself:
element.following_sibling element.following_sibling[0]
Should we always build a list of all siblings for both cases? Also, it doesn't match the getchildren() API call (which /is/ explicit).
list and list-like-object are different things, in case you're cared about perfomance explicit means using well-know, interoperable interface as much as possible (file-like-objects are great example), this doesn't mean using _exact_ class for a task, but using _exact_ interface. of course it's a little bit less explicit than using list .children, which could be the only container and the mean to access contained nodes, but things are already not that way, so it doesn't count
So, if we follow the axis naming exactly, all that is really missing is getfollowingsibling() and getprecedingsibling(). Now, those two are rather hard to read, but getfollowing() and getpreceding() are just wrong in terms of XPath. So, I still prefer getnext() and getprevious().
I thought a little more about it, that wouldn't hurt much, I suppose at the moment lxml is a bloat of different approaches and inconsistent api's, so adding just a little bit more of it is nothing.

Hi Andrey, Andrey Tatarinov wrote:
at the moment lxml is a bloat of different approaches and inconsistent api's, so adding just a little bit more of it is nothing.
Ah, finally, that's good news. Anything specific in your mind that you might want to change regarding the current API? Stefan :)

On Fri, 2006-06-02 at 15:07 +0200, Stefan Behnel wrote:
Hi Andrey,
Andrey Tatarinov wrote:
at the moment lxml is a bloat of different approaches and inconsistent api's, so adding just a little bit more of it is nothing.
Ah, finally, that's good news. Anything specific in your mind that you might want to change regarding the current API?
That's a question for more than just 10 minutes which I can afford at the moment. The obvious ones: - element's .xpath, .getiterator - xslt result's .__str__ I hope, I wouldn't forget to make deeper examination and will write it to ML sometime.

Hi Andrey, Andrey Tatarinov schrieb:
On Fri, 2006-06-02 at 15:07 +0200, Stefan Behnel wrote:
Hi Andrey,
Andrey Tatarinov wrote:
at the moment lxml is a bloat of different approaches and inconsistent api's, so adding just a little bit more of it is nothing. Ah, finally, that's good news. Anything specific in your mind that you might want to change regarding the current API?
That's a question for more than just 10 minutes which I can afford at the moment.
The obvious ones: - element's .xpath, .getiterator
Uhm, I guess you mean .xpath() and .findall() here, .getiterator() does something different. Sure, the path expressions accepted by both are different.
- xslt result's .__str__
How's that inconsistent? All this is saying is "I know how to become a string", which is right away true for XSLT results (but not for arbitrary trees, if that's what you're comparing to). I'm actually happy you didn't come up with anything important in your first shot. That makes me confident for your future criticism. Stefan

Andrey Tatarinov wrote: [snip]
Ok, let's walk that through. Here are the other axes and their current API:
* ancestor - subsequent calls to getparent() * child - element[i] or getchildren() * descendant - getiterator() * following - ? * following-sibling - ? * parent - getparent() * preceding - ? * preceding-sibling - ?
sorry, but it's a mess
It's possible to express XPath axes in terms of simple operations like this. Here's an example of code (from Forest, an attempt at an XML database): self.selfAxis = lambda nodes: nodes self.childAxis = Concat(Map(doc.firstChild), TransitiveClosure([doc.nextSibling])) self.parentAxis = Concat(TransitiveClosure([doc.nextSiblingInverse]), Map(doc.firstChildInverse)) self.descendantAxis = Concat(Map(doc.firstChild), TransitiveClosure([doc.firstChild, doc.nextSibling])) self.ancestorAxis = Concat( TransitiveClosure([doc.firstChildInverse, doc.nextSiblingInverse]), Map(doc.firstChildInverse)) self.descendantOrSelfAxis = AxisUnion(self.descendantAxis, self.selfAxis) self.ancestorOrSelfAxis = AxisUnion(self.ancestorAxis, self.selfAxis) self.followingAxis = Concat( Concat(Concat(self.ancestorOrSelfAxis, Map(doc.nextSibling)), TransitiveClosure([doc.nextSibling])), self.descendantOrSelfAxis) self.precedingAxis = Concat( Concat(Concat(self.ancestorOrSelfAxis, Map(doc.nextSiblingInverse)), TransitiveClosure([doc.nextSiblingInverse])), self.descendantOrSelfAxis) self.followingSiblingAxis = Concat( Map(doc.nextSibling), TransitiveClosure([doc.nextSibling])) self.precedingSiblingAxis = Concat( Map(doc.nextSiblingInverse), TransitiveClosure([doc.nextSiblingInverse])) self.attributeAxis = Concat(Map(doc.firstAttribute), TransitiveClosure([doc.nextAttribute])) https://infrae.com/viewvc/old/forest/trunk/src/forest/axes.py And also an example of some higher order functional programming in Python. :) As you can see, to define all the axes you only need firstChild (element[0]), nextSibling (getnext()), firstChildInverse (getparent()) and nextSiblingInverse (getprevious()), except for the attribute axis. Of course, as far as I'm aware, XPath is defined to walk over a tree with text nodes present so I'm not sure whether this is all relevant at all. Anyway, the XPath database model is not the be-all and end-all of XML tree navigation. DOM for instance defines things much like getparent(), getnext() and so on. Regards, Martijn
participants (3)
-
Andrey Tatarinov
-
Martijn Faassen
-
Stefan Behnel