lxml: traverse xml tree and retrieve element based on an attribute

Thu May 21 18:57:00 EDT 2009

byron wrote:
> I am using the lxml.etree library to validate an xml instance file
> with a specified schema that contains the data types of each element.
> This is some of the internals of a function that extracts the
> elements:
> 
>         schema_doc = etree.parse(schema_fn)
>         schema = etree.XMLSchema(schema_doc)
> 
>         context = etree.iterparse(xml_fn, events=('start', 'end'),
> schema=schema)
> 
>         # get root
>         event, root = context.next()
> 
>         for event, elem in context:
>             if event == 'end' and elem.tag == self.tag:
>                 yield elem
>             root.clear()
> 
> I retrieve a list of elements from this... and do further processing
> to represent them in different ways. I need to be able to capture the
> data type from the schema definition for each field in the element.
> i.e.
> 
>     <xsd:element name="concept">
>         <xsd:complexType>
>             <xsd:sequence>
>                 <xsd:element ref="foo"/>
>                 <xsd:element name="concept_id" type="xsd:string"/>
>                 <xsd:element name="line" type="xsd:integer"/>
>                 <xsd:element name="concept_value" type="xsd:string"/>
>                 <xsd:element ref="some_date"/>
>             </xsd:sequence>
>         </xsd:complexType>
>     </xsd:element>
> 
> My thought is to recursively traverse through the schema definition
> match the `name` attribute since they are unique to a `type` and
> return that element. But I can't seem to make it quite work. All the
> xml is valid, validation works, etc. This is what I have:
> 
>     def find_node(tree, name):
>         for c in tree:
>             if c.attrib.get('name') == name:
>                 return c
>             if len(c) > 0:
>                 return find_node(c, name)
>     return 0
> 
You're searching the first child and then returning the result, but what
you're looking for might not be in the first child; if it's not then you
need to search the next child:

     def find_node(tree, name):
         for c in tree:
             if c.attrib.get('name') == name:
                 return c
             if len(c) > 0:
                 r = find_node(c, name)
                 if r:
                     return r
         return None

> I may have been staring at this too long, but when something is
> returned... it should be returned completely, no? This is what occurs
> with `return find_node(c, name) if it returns 0. `return c` works
> (used pdb to verify that), but the recursion continues and ends up
> returning 0.
> 
> Thoughts and/or a different approach are welcome. Thanks