parsing nested unbounded XML fields with ElementTree

Stefan Behnel stefan_ml at behnel.de
Tue Nov 26 08:38:13 CET 2013


Larry.Martell... at gmail.com, 25.11.2013 23:22:
> I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:
> 
> <Node Name="A">
>    <Node Name="B">
>       <Node Name="C">
>         <Node Name="D">
>           <Node Name="E">
> 
> When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:
> 
> nodes = []
> 
> def parseChild(c):
>     if c.tag == 'Node':
>         if 'Name' in c.attrib: 
>             nodes.append(c.attrib['Name'])
>         for c1 in c:
>             parseChild(c1)
>     else:
>         for node in nodes:
>             print node,
>         print c.tag
> 
> for parent in tree.getiterator():
>     for child in parent:
>         for x in child:
>             parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.


> My problem is that I don't know when I'm done with a node and I should
> remove a level of nesting. I would think this is a fairly common
> situation, but I could not find any examples of parsing a file like
> this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Alternatively, if you want to use lxml.etree instead of ElementTree, you
can use it's iterwalk() function, which gives you the same thing but
without recursion, as a plain iterator.

http://lxml.de/parsing.html#iterparse-and-iterwalk

Stefan





More information about the Python-list mailing list