[Tutor] Trouble Parsing XML using lxml
Stefan Behnel
stefan_ml at behnel.de
Mon Apr 6 08:49:36 CEST 2009
marc at marcd.org wrote:
> I am trying to parse a structure that looks like:
>
> {urn:FindingImport}TOOL - GD
> {urn:FindingImport}TOOL_VERSION - 2.0.8.8
> {urn:FindingImport}AUTHENTICATED_FINDING - TRUE
> {urn:FindingImport}GD_VUL_NAME - Rename Built-in Guest Account
> {urn:FindingImport}GD_SEVERITY - 2
> {urn:FindingImport}FINDING - None
> {urn:FindingImport}FINDING_ID - V0001115
> {urn:FindingImport}FINDING_STATUS - NF
> {urn:FindingImport}TOOL - GD
> {urn:FindingImport}TOOL_VERSION - 2.0.8.8
> {urn:FindingImport}AUTHENTICATED_FINDING - TRUE
> {urn:FindingImport}GD_VUL_NAME - Rename Built-in Administrator Account
> {urn:FindingImport}GD_SEVERITY - 2
> {urn:FindingImport}FINDING - None
> {urn:FindingImport}FINDING_ID - V0001117
>
> This is the result when the original data is run through 'for element in
> root.iter():' as described in the lxml tutorial.
Note that this does not give you the "structure" (i.e. the hierarchy of
elements) but only the plain elements in document order. XML is a tree
structure that has elements at the same level and child-parent
relationships between elements at different hierarchy levels.
> This structure repeats
> many times in the document with different values after each tag. I want
> to take the values and place them in one csv line for each structure in
> the file. The closest I have come is something like (but doesn't work):
>
> for element in root.iter("{urn:FindingImport}TOOL"):
> print element.text
> print element.getnext().text
> print element.getnext().text
>
> The initial print element.tag and the first element.getnext().text work as
> I would like, but I am not finding a way to parse past that. The second
> element.getnext().text returns the value for the same tag as the one prior
> to it.
.getnext() returns the sibling of the element, not its child. I assume that
"TOOL" is the top-level element of the repeating subtree that you want to
extract here. In that case, you can use e.g.
element.find("{urn:FindingImport}GD_VUL_NAME")
to retrieve the subelement named 'GD_VUL_NAME', or
element.findtext("{urn:FindingImport}GD_VUL_NAME")
to retrieve its text content directly.
You should also take a look at lxml.objectify, which provides a very handy
way to deal with the kind of XML that you have here. It will allow you to
do this:
for tool in root.iter("{urn:FindingImport}TOOL"):
print tool.GD_VUL_NAME, tool.FINDING
BTW, if all you want is to map the XML to CSV, without any major
restructuring in between, take a look at iterparse(). It works a lot like
the .iter() method, but iterates during parsing, which allows you to delete
subtrees after use to safe memory.
Stefan
More information about the Tutor
mailing list