options for reverse iteration

Hi, I was wondering what options lxml has, if any, for iterating over the descendants of a tree in reverse document order. For example, it would be nice if the API had a signature similar to iter(), e.g. iterreverse(self, *tags). If not, are there any recipes to do this efficiently? I know, for example, that lxml has methods to iterate over the children of an element in reverse order, but not the descendants from what I can see. Thanks, --Chris

Chris Jerdonek schrieb am 20.10.20 um 22:38:
True. That's missing. I mean, you can always collect the descendants in a list and reverse it, i.e. for element in reversed(list(root.iter())): ... But there is no direct support for reversed iteration. It's certainly a rare use case that I have never run into myself. Could you give me an idea of what you need it for? If it was to be added, the usage should be for element in reversed(root.iter()): ... with direct support for "__reversed__()" by the iterator, which would return a different iterator here that runs in the opposite direction. That's not entirely trivial to implement since the current implementation is very specialised and geared towards maximum performance. Looking at the inline C macros in etree_defs.h that implement the traversal, it looks like adding a flag and duplicating "_LX__ADVANCE_TO_NEXT" might be enough at that level, with new top-level macros "BEGIN_REVERSED_FOR_EACH_FROM" etc. https://github.com/lxml/lxml/blob/b055581bf4492de6da7678fbe7404b0232da6d84/s... Then, I'd add the same flag to the ElementDepthFirstIterator, together with a "__reversed__(self)" method that creates a new instance with the flag inverted. https://github.com/lxml/lxml/blob/b083124281d824eb861ff58e7276a5c1f1d8c18d/s... Thinking of that, maybe it's not entirely that easy, since it would have to start at the current node. But that can be done, too, I'm sure. Probably just copying over the configuration of the current iterator. PR welcome. :) Stefan

On Tue, Oct 20, 2020 at 10:57 PM Stefan Behnel <stefan_ml@behnel.de> wrote:
Thanks for your thoughts. It's possible there might be a way around needing it for the use case I have in mind. But to describe the general idea, think of an operation where you want to recursively remove leaves from the tree. But whether you remove a leaf is a condition that depends on the siblings after it. If you iterate forwards and you wind up removing a leaf, you may have to go backwards and revisit a previous element (e.g. a previous sibling), because the evaluation of the condition may now have changed for the previous element. But if you iterate backwards and you wind up removing a leaf, you won't need to go backwards in your iteration (meaning forwards in document order) because the condition will change only for leaves that you haven't visited yet in your backwards iteration. For the use case I described, it won't exactly be iteration, but rather repeated applications of the function "get the first element satisfying a certain condition going backwards," which is equivalent to having a way of iterating backwards. --Chris

Chris Jerdonek schrieb am 21.10.20 um 09:00:
This sounds like a use case that is so document specific that a recursive scheme might work well. You don't have to use lxml's tree iterators, you can also just iterate over siblings yourself and go up and down the hierarchy. The tree iterator is fast, especially if you have something specific to look for. But the more involved the operations afterwards are, the less important the iteration speed becomes. Then the question is how convenient the tree iterator still is if you have to do your own local traversals and state keeping anyway. Also note that deep tree iteration isn't efficient if you are only (or mostly) interested in elements that live close to the root element. In that case, sibling iteration might be faster the deeper or larger the document becomes. I would suggest to use iteration (maybe .iterfind() with a concrete path) for finding a good series of elements to start at, and then walk your own way from those. Consider injecting a 1-element read-ahead generator if you need to modify the tree during iteration. I've added a recipe for the read-ahead: https://github.com/lxml/lxml/commit/c053dc159c7f0a6a98922c937a0baede7ce7af9d Stefan

On Wed, Oct 21, 2020 at 2:20 AM Stefan Behnel <stefan_ml@behnel.de> wrote:
Yes, I'm very familiar with this approach. :)
All good to know..
Yes, such a wrapper generator has also occurred to me for similar uses (but just the 1-element read-ahead which I don't think requires additional imports), though I haven't had need to use it yet. It's good to see the idea / recipe for the more general case. Thank you! --Chris

On Wed, Oct 21, 2020 at 1:15 PM Stefan Behnel <stefan_ml@behnel.de> wrote:
Here's another approach for the 1-ahead case.. Does this look any simpler than what you first had? def read_ahead(items): iterator = itertools.chain(items, [0]) prev = next(iterator) for x in iterator: yield prev prev = x --Chris

Chris Jerdonek schrieb am 20.10.20 um 22:38:
True. That's missing. I mean, you can always collect the descendants in a list and reverse it, i.e. for element in reversed(list(root.iter())): ... But there is no direct support for reversed iteration. It's certainly a rare use case that I have never run into myself. Could you give me an idea of what you need it for? If it was to be added, the usage should be for element in reversed(root.iter()): ... with direct support for "__reversed__()" by the iterator, which would return a different iterator here that runs in the opposite direction. That's not entirely trivial to implement since the current implementation is very specialised and geared towards maximum performance. Looking at the inline C macros in etree_defs.h that implement the traversal, it looks like adding a flag and duplicating "_LX__ADVANCE_TO_NEXT" might be enough at that level, with new top-level macros "BEGIN_REVERSED_FOR_EACH_FROM" etc. https://github.com/lxml/lxml/blob/b055581bf4492de6da7678fbe7404b0232da6d84/s... Then, I'd add the same flag to the ElementDepthFirstIterator, together with a "__reversed__(self)" method that creates a new instance with the flag inverted. https://github.com/lxml/lxml/blob/b083124281d824eb861ff58e7276a5c1f1d8c18d/s... Thinking of that, maybe it's not entirely that easy, since it would have to start at the current node. But that can be done, too, I'm sure. Probably just copying over the configuration of the current iterator. PR welcome. :) Stefan

On Tue, Oct 20, 2020 at 10:57 PM Stefan Behnel <stefan_ml@behnel.de> wrote:
Thanks for your thoughts. It's possible there might be a way around needing it for the use case I have in mind. But to describe the general idea, think of an operation where you want to recursively remove leaves from the tree. But whether you remove a leaf is a condition that depends on the siblings after it. If you iterate forwards and you wind up removing a leaf, you may have to go backwards and revisit a previous element (e.g. a previous sibling), because the evaluation of the condition may now have changed for the previous element. But if you iterate backwards and you wind up removing a leaf, you won't need to go backwards in your iteration (meaning forwards in document order) because the condition will change only for leaves that you haven't visited yet in your backwards iteration. For the use case I described, it won't exactly be iteration, but rather repeated applications of the function "get the first element satisfying a certain condition going backwards," which is equivalent to having a way of iterating backwards. --Chris

Chris Jerdonek schrieb am 21.10.20 um 09:00:
This sounds like a use case that is so document specific that a recursive scheme might work well. You don't have to use lxml's tree iterators, you can also just iterate over siblings yourself and go up and down the hierarchy. The tree iterator is fast, especially if you have something specific to look for. But the more involved the operations afterwards are, the less important the iteration speed becomes. Then the question is how convenient the tree iterator still is if you have to do your own local traversals and state keeping anyway. Also note that deep tree iteration isn't efficient if you are only (or mostly) interested in elements that live close to the root element. In that case, sibling iteration might be faster the deeper or larger the document becomes. I would suggest to use iteration (maybe .iterfind() with a concrete path) for finding a good series of elements to start at, and then walk your own way from those. Consider injecting a 1-element read-ahead generator if you need to modify the tree during iteration. I've added a recipe for the read-ahead: https://github.com/lxml/lxml/commit/c053dc159c7f0a6a98922c937a0baede7ce7af9d Stefan

On Wed, Oct 21, 2020 at 2:20 AM Stefan Behnel <stefan_ml@behnel.de> wrote:
Yes, I'm very familiar with this approach. :)
All good to know..
Yes, such a wrapper generator has also occurred to me for similar uses (but just the 1-element read-ahead which I don't think requires additional imports), though I haven't had need to use it yet. It's good to see the idea / recipe for the more general case. Thank you! --Chris

On Wed, Oct 21, 2020 at 1:15 PM Stefan Behnel <stefan_ml@behnel.de> wrote:
Here's another approach for the 1-ahead case.. Does this look any simpler than what you first had? def read_ahead(items): iterator = itertools.chain(items, [0]) prev = next(iterator) for x in iterator: yield prev prev = x --Chris
participants (2)
-
Chris Jerdonek
-
Stefan Behnel