Element iteration and removal
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
Hello, Suppose I iterate over the elements of a subtree, then is it safe to remove the elements during the iteration: for elem in t.xpath("br"): p = elem.getparent() p.remove(elem) # _Seems_ to work? or will that mess with the iterator, and I should collect the elements first: for elem in list(t.xpath("br")): # now remove I guess I can rephrase the question: is xpath() lazy? Just want to make sure. Thanks! Jens -- Jens Tröger http://savage.light-speed.de/
data:image/s3,"s3://crabby-images/2ffc5/2ffc57797bd7cd44247b24896591b7a1da6012d6" alt=""
On Wed, Apr 4, 2018 at 10:55 PM, Jens Tröger <jens.troeger@light-speed.de> wrote:
Hello,
Suppose I iterate over the elements of a subtree, then is it safe to remove the elements during the iteration:
for elem in t.xpath("br"): p = elem.getparent() p.remove(elem) # _Seems_ to work?
or will that mess with the iterator, and I should collect the elements first:
for elem in list(t.xpath("br")): # now remove
I've encountered situations where mutating the tree while iterating over it messes up the iterator (not with xpath(), but with some of the other standard iterators exposed by lxml). So I now always do the latter. I would *love* to know if there are simple conditions in which mutating the tree while iterating is guaranteed to be safe. I've never found anything in the docs saying that it's okay to mutate a tree while iterating over it, and I've looked on and off for such. By the way, you might be interested in the skip_subtree() method of iterwalk() that lets you skip over an element's subtree on a case-by-case basis. That can be used to reduce the size of the list you're building in the latter / "collecting first" approach. --Chris
I guess I can rephrase the question: is xpath() lazy?
Just want to make sure. Thanks! Jens
-- Jens Tröger http://savage.light-speed.de/ _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Chris Jerdonek schrieb am 05.04.2018 um 09:01:
On Wed, Apr 4, 2018 at 10:55 PM, Jens Tröger wrote:
Suppose I iterate over the elements of a subtree, then is it safe to remove the elements during the iteration:
for elem in t.xpath("br"): p = elem.getparent() p.remove(elem) # _Seems_ to work?
The XPath implementation always returns the complete final result, i.e. a list of elements in this case. (And you can see that by printing the result.)
or will that mess with the iterator, and I should collect the elements first:
for elem in list(t.xpath("br")): # now remove
I've encountered situations where mutating the tree while iterating over it messes up the iterator (not with xpath(), but with some of the other standard iterators exposed by lxml). So I now always do the latter.
I would *love* to know if there are simple conditions in which mutating the tree while iterating is guaranteed to be safe. I've never found anything in the docs saying that it's okay to mutate a tree while iterating over it, and I've looked on and off for such.
Basic rule: if you change the structure of or "behind" the element that was last returned by the iterator, this can divert the iterator. Anything that the iterator has already stepped over is safe. "Behind" refers to the next element(s) that the iterator has to touch. That's not 100% clear to predict, as the internal implementation might look ahead or could decide to optimise some cases and touch somewhat more or other nodes than you might expect. Also, the fact that text nodes are part of the internal tree structure but not visible to users makes the boundary conditions less clear, even though the iterator will always sit and wait on element(-ish) nodes and never on text nodes. But at least the direction in which the iterator looks is obvious from the iteration order. I checked and you're right, it's only documented for iterparse(), not for the tree iteration methods. http://lxml.de/1.3/parsing.html#modifying-the-tree
By the way, you might be interested in the skip_subtree() method of iterwalk() that lets you skip over an element's subtree on a case-by-case basis. That can be used to reduce the size of the list you're building in the latter / "collecting first" approach.
Yes, iterwalk() really feels like the ultimate tool for anything you need in terms of document order iteration now. Even for complex traversal cases, you can easily wrapping it in a generator function to build a tightly adapted iterator. Similarly, you can always make local modifications during iteration safe by wrapping the tree iterator with a little generator that looks one element ahead, and thus makes sure that the tree iterator has already passed the position which you are modifying. Stefan
data:image/s3,"s3://crabby-images/8bbe6/8bbe681f08550d13b35a459376ee85cf203c1262" alt=""
Hi,
Suppose I iterate over the elements of a subtree, then is it safe to remove the elements during the iteration:
for elem in t.xpath("br"): p = elem.getparent() p.remove(elem) # _Seems_ to work?
or will that mess with the iterator, and I should collect the elements first:
for elem in list(t.xpath("br")): # now remove
I guess I can rephrase the question: is xpath() lazy?
xpath() returns a list, not an iterator (well, depending on the XPath expression, see http://lxml.de/xpathxslt.html#xpath "XPath return values"). Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz
participants (4)
-
Chris Jerdonek
-
Holger Joukl
-
Jens Tröger
-
Stefan Behnel