lxml: traverse xml tree and retrieve element based on an attribute
MRAB
google at mrabarnett.plus.com
Thu May 21 18:57:00 EDT 2009
byron wrote:
> I am using the lxml.etree library to validate an xml instance file
> with a specified schema that contains the data types of each element.
> This is some of the internals of a function that extracts the
> elements:
>
> schema_doc = etree.parse(schema_fn)
> schema = etree.XMLSchema(schema_doc)
>
> context = etree.iterparse(xml_fn, events=('start', 'end'),
> schema=schema)
>
> # get root
> event, root = context.next()
>
> for event, elem in context:
> if event == 'end' and elem.tag == self.tag:
> yield elem
> root.clear()
>
> I retrieve a list of elements from this... and do further processing
> to represent them in different ways. I need to be able to capture the
> data type from the schema definition for each field in the element.
> i.e.
>
> <xsd:element name="concept">
> <xsd:complexType>
> <xsd:sequence>
> <xsd:element ref="foo"/>
> <xsd:element name="concept_id" type="xsd:string"/>
> <xsd:element name="line" type="xsd:integer"/>
> <xsd:element name="concept_value" type="xsd:string"/>
> <xsd:element ref="some_date"/>
> </xsd:sequence>
> </xsd:complexType>
> </xsd:element>
>
> My thought is to recursively traverse through the schema definition
> match the `name` attribute since they are unique to a `type` and
> return that element. But I can't seem to make it quite work. All the
> xml is valid, validation works, etc. This is what I have:
>
> def find_node(tree, name):
> for c in tree:
> if c.attrib.get('name') == name:
> return c
> if len(c) > 0:
> return find_node(c, name)
> return 0
>
You're searching the first child and then returning the result, but what
you're looking for might not be in the first child; if it's not then you
need to search the next child:
def find_node(tree, name):
for c in tree:
if c.attrib.get('name') == name:
return c
if len(c) > 0:
r = find_node(c, name)
if r:
return r
return None
> I may have been staring at this too long, but when something is
> returned... it should be returned completely, no? This is what occurs
> with `return find_node(c, name) if it returns 0. `return c` works
> (used pdb to verify that), but the recursion continues and ends up
> returning 0.
>
> Thoughts and/or a different approach are welcome. Thanks
More information about the Python-list
mailing list