starting root.iter(*tags) midway through
Hi, I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it. Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags? It would suffice to be able to specify the starting point for the iterator `root.iter(*tags)` -- in other words to start it midway through. Is there a way to do that that doesn't involve, say, first iterating through the first elements of `root.iter(*tags)`? Thanks, --Chris
Am .11.2017, 14:55 Uhr, schrieb Chris Jerdonek <chris.jerdonek@gmail.com>:
It would suffice to be able to specify the starting point for the iterator `root.iter(*tags)` -- in other words to start it midway through. Is there a way to do that that doesn't involve, say, first iterating through the first elements of `root.iter(*tags)`?
Depends on how you define efficiency… You can always use XPath to access specific elements or just `find(fully_qualified_tagname)`. However, if it's a large file you might want to keep memory use as low as possible and use the event-based `iterparse()` approach. lxml adds some comfort to the etree.iterparse function by letting you pass in a list of the tags you're interested in. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
On Thu, Nov 23, 2017 at 6:18 AM, Charlie Clark <charlie.clark@clark-consulting.eu> wrote:
Am .11.2017, 14:55 Uhr, schrieb Chris Jerdonek <chris.jerdonek@gmail.com>:
It would suffice to be able to specify the starting point for the iterator `root.iter(*tags)` -- in other words to start it midway through. Is there a way to do that that doesn't involve, say, first iterating through the first elements of `root.iter(*tags)`?
Depends on how you define efficiency… You can always use XPath to access specific elements or just `find(fully_qualified_tagname)`. However, if it's a large file you might want to keep memory use as low as possible and use the event-based `iterparse()` approach.
Thanks, Charlie. But in my question you can assume the document is already parsed and the starting element is already accessed. So the question is: does lxml provide a way to find the element after that in the tree? It would be nice to be able to do something like-- next(root.iter(*tags)[element:]) # first element strictly after "element" --Chris
lxml adds some comfort to the etree.iterparse function by letting you pass in a list of the tags you're interested in.
Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226 _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
Chris Jerdonek schrieb am 23.11.2017 um 14:55:
I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it.
Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags?
Your use case isn't clear to me. How do you get at that element? Could you provide some more details? Stefan
On Fri, Nov 24, 2017 at 12:42 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Chris Jerdonek schrieb am 23.11.2017 um 14:55:
I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it.
Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags?
Your use case isn't clear to me. How do you get at that element? Could you provide some more details?
Thanks. I'll provide more details. Given a tree, I want to perform some processing on every subtree in the tree having certain properties. The properties are (1) that the root element of the subtree is one of a number of tags, say "a" and "b", and (2) the root doesn't have any ancestors with those tags. So these can be thought of as the "maximal" subtrees with a root having tag "a" or "b". It's straightforward to get the first subtree using iter(): element = root.iter('a', 'b') However, to get the next subtree to process, it's not as straightforward. The approach I'm using is first to use lxml's API to get the element in the tree that follows the subtree (if it exists). Call that element "next_element". This next element doesn't necessarily satisfy the property I'm looking for, so here is where my question comes in. What I'm looking for is the first element in root.iter('a', 'b') that is equal to or after next_element. This is why it would be useful to be able to "start" iterating over root.iter('a', 'b') from an arbitrary starting element. What I'm suggesting / asking for is a bit like adding a "start" argument to str.find() if it initially only supported searching from start index 0: https://docs.python.org/3/library/stdtypes.html#str.find The alternative solution I've come up with doesn't seem as efficient or elegant. Thank you, --Chris
Chris Jerdonek schrieb am 24.11.2017 um 10:31:
On Fri, Nov 24, 2017 at 12:42 AM, Stefan Behnel wrote:
Chris Jerdonek schrieb am 23.11.2017 um 14:55:
I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it.
Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags?
Your use case isn't clear to me. How do you get at that element? Could you provide some more details?
Thanks. I'll provide more details.
Given a tree, I want to perform some processing on every subtree in the tree having certain properties. The properties are (1) that the root element of the subtree is one of a number of tags, say "a" and "b", and (2) the root doesn't have any ancestors with those tags. So these can be thought of as the "maximal" subtrees with a root having tag "a" or "b".
This sounds like iterwalk() might solve your problem. http://lxml.de/parsing.html#iterparse-and-iterwalk Stefan
Thanks. I didn't know about iterwalk() with its skip_subtree(). I'll give it a shot. --Chris On Fri, Nov 24, 2017 at 5:29 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Chris Jerdonek schrieb am 24.11.2017 um 10:31:
On Fri, Nov 24, 2017 at 12:42 AM, Stefan Behnel wrote:
Chris Jerdonek schrieb am 23.11.2017 um 14:55:
I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it.
Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags?
Your use case isn't clear to me. How do you get at that element? Could you provide some more details?
Thanks. I'll provide more details.
Given a tree, I want to perform some processing on every subtree in the tree having certain properties. The properties are (1) that the root element of the subtree is one of a number of tags, say "a" and "b", and (2) the root doesn't have any ancestors with those tags. So these can be thought of as the "maximal" subtrees with a root having tag "a" or "b".
This sounds like iterwalk() might solve your problem.
http://lxml.de/parsing.html#iterparse-and-iterwalk
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
Hi Stefan, Am I right that there is a bit of asymmetry in the way that multiple tags are handled by iterwalk(), as compared to, say, methods like _Element.iter()? Whereas _Element.iter()'s signature is-- iter(self, tag=None, *tags) iterwalk()'s signature is-- iterwalk(self, element_or_tree, events=("end",), tag=None), It doesn't seem to be documented that iterwalk() accepts more than one tag, but empirically I found that a tuple of tags can be passed as the tag argument to iterwalk(). Is that intentional? Thanks, --Chris On Fri, Nov 24, 2017 at 9:08 PM, Chris Jerdonek <chris.jerdonek@gmail.com> wrote:
Thanks. I didn't know about iterwalk() with its skip_subtree(). I'll give it a shot.
--Chris
On Fri, Nov 24, 2017 at 5:29 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Chris Jerdonek schrieb am 24.11.2017 um 10:31:
On Fri, Nov 24, 2017 at 12:42 AM, Stefan Behnel wrote:
Chris Jerdonek schrieb am 23.11.2017 um 14:55:
I have a seemingly simple lxml.etree use case, but the API doesn't seem to support it.
Say I have an Element "root" at the root of a tree, and say I have an element "element" inside the tree. Is there an efficient way to get the element **after** "element" (in document order), and matching given tags?
Your use case isn't clear to me. How do you get at that element? Could you provide some more details?
Thanks. I'll provide more details.
Given a tree, I want to perform some processing on every subtree in the tree having certain properties. The properties are (1) that the root element of the subtree is one of a number of tags, say "a" and "b", and (2) the root doesn't have any ancestors with those tags. So these can be thought of as the "maximal" subtrees with a root having tag "a" or "b".
This sounds like iterwalk() might solve your problem.
http://lxml.de/parsing.html#iterparse-and-iterwalk
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
Am 26. November 2017 08:08:49 MEZ schrieb Chris Jerdonek:
Am I right that there is a bit of asymmetry in the way that multiple tags are handled by iterwalk(), as compared to, say, methods like _Element.iter()?
Whereas _Element.iter()'s signature is--
iter(self, tag=None, *tags)
iterwalk()'s signature is--
iterwalk(self, element_or_tree, events=("end",), tag=None),
It doesn't seem to be documented that iterwalk() accepts more than one tag, but empirically I found that a tuple of tags can be passed as the tag argument to iterwalk(). Is that intentional?
Sort of. There used to be only the "tag" argument, which accepted one name. When adding support for multiple names, I didn't want to change the interfaces everywhere by renaming it to "tags", so I left it as it is and allowed tuples. iter() is special in that it accepts positional arguments, because it's such a widely used and exposed utility, and because it didn't conflict with its original interface. Historical reasons, as always in these cases. Documentation improvements welcome. Stefan
On Sat, Nov 25, 2017 at 11:55 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Am 26. November 2017 08:08:49 MEZ schrieb Chris Jerdonek:
It doesn't seem to be documented that iterwalk() accepts more than one tag, but empirically I found that a tuple of tags can be passed as the tag argument to iterwalk(). Is that intentional?
Sort of. There used to be only the "tag" argument, which accepted one name. When adding support for multiple names, I didn't want to change the interfaces everywhere by renaming it to "tags", so I left it as it is and allowed tuples. iter() is special in that it accepts positional arguments, because it's such a widely used and exposed utility, and because it didn't conflict with its original interface.
Historical reasons, as always in these cases.
Documentation improvements welcome.
Thanks for the explanation. I posted a small improvement here: https://github.com/lxml/lxml/pull/257 --Chris
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (3)
-
Charlie Clark
-
Chris Jerdonek
-
Stefan Behnel