[Tutor] Trouble Parsing XML using lxml

Mon Apr 6 08:49:36 CEST 2009

marc at marcd.org wrote:
> I am trying to parse a structure that looks like:
> 
> {urn:FindingImport}TOOL - GD
> {urn:FindingImport}TOOL_VERSION - 2.0.8.8
> {urn:FindingImport}AUTHENTICATED_FINDING - TRUE
> {urn:FindingImport}GD_VUL_NAME - Rename Built-in Guest Account
> {urn:FindingImport}GD_SEVERITY - 2
> {urn:FindingImport}FINDING - None
> {urn:FindingImport}FINDING_ID - V0001115
> {urn:FindingImport}FINDING_STATUS - NF
> {urn:FindingImport}TOOL - GD
> {urn:FindingImport}TOOL_VERSION - 2.0.8.8
> {urn:FindingImport}AUTHENTICATED_FINDING - TRUE
> {urn:FindingImport}GD_VUL_NAME - Rename Built-in Administrator Account
> {urn:FindingImport}GD_SEVERITY - 2
> {urn:FindingImport}FINDING - None
> {urn:FindingImport}FINDING_ID - V0001117
> 
> This is the result when the original data is run through 'for element in
> root.iter():' as described in the lxml tutorial.

Note that this does not give you the "structure" (i.e. the hierarchy of
elements) but only the plain elements in document order. XML is a tree
structure that has elements at the same level and child-parent
relationships between elements at different hierarchy levels.

> This structure repeats
> many times in the document with different values after each tag.  I want
> to take the values and place them in one csv line for each structure in
> the file.  The closest I have come is something like (but doesn't work):
> 
>     for element in root.iter("{urn:FindingImport}TOOL"):
>         print element.text
>         print element.getnext().text
>         print element.getnext().text
> 
> The initial print element.tag and the first element.getnext().text work as
> I would like, but I am not finding a way to parse past that.  The second
> element.getnext().text returns the value for the same tag as the one prior
> to it.

.getnext() returns the sibling of the element, not its child. I assume that
"TOOL" is the top-level element of the repeating subtree that you want to
extract here. In that case, you can use e.g.

	element.find("{urn:FindingImport}GD_VUL_NAME")

to retrieve the subelement named 'GD_VUL_NAME', or

	element.findtext("{urn:FindingImport}GD_VUL_NAME")

to retrieve its text content directly.

You should also take a look at lxml.objectify, which provides a very handy
way to deal with the kind of XML that you have here. It will allow you to
do this:

     for tool in root.iter("{urn:FindingImport}TOOL"):
         print tool.GD_VUL_NAME, tool.FINDING

BTW, if all you want is to map the XML to CSV, without any major
restructuring in between, take a look at iterparse(). It works a lot like
the .iter() method, but iterates during parsing, which allows you to delete
subtrees after use to safe memory.

Stefan