[Tutor] removing nodes using ElementTree
Peter Otten
__peter__ at web.de
Mon Jun 29 10:58:26 CEST 2015
street.sweeper at mailworks.org wrote:
> Hello all,
>
> I'm trying to merge and filter some xml. This is working well, but I'm
> getting one node that's not in my list to include. Python version is
> 3.4.0.
>
> The goal is to merge multiple xml files and then write a new one based
> on whether or not <pid> is in an include list. In the mock data below,
> the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
> in my list. The output is correctly formed xml, but it includes 5 <rec>
> nodes; the 4 in the list, plus 89012 from input1.xml. It runs without
> error. I've used used type() to compare
> rec.find('part').find('pid').text and the items in the list, they're
> strings. When the first for loop is done, xmlet has 8 rec nodes. Is
> there a problem in the iteration in the second for? Any other
> recommendations also welcome. Thanks!
>
>
> The code itself was cobbled together from two sources,
> http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-> command-line/11315257#11315257
> and http://bryson3gps.wordpress.com/tag/elementtree/
>
> Here's the code and data:
>
> #!/usr/bin/env python3
>
> import os, glob
> from xml.etree import ElementTree as ET
>
> xmls = glob.glob('input*.xml')
> ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
> xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')
>
> il = [x.strip() for x in open(ilf)]
>
> xmlet = None
>
> for xml in xmls:
> d = ET.parse(xml).getroot()
> for rec in d.iter('inv'):
> if xmlet is None:
> xmlet = d
> else:
> xmlet.extend(rec)
>
> for rec in xmlet:
> if rec.find('part').find('pid').text not in il:
> xmlet.remove(rec)
>
> ET.ElementTree(xmlet).write(xo)
>
> quit()
I believe Alan answered your question; I just want to thank you for
taking the time to describe your problem clearly and for providing
all the necessary parts to reproduce it.
Bonus part:
Other options to filter a mutable sequence:
(1) assign to the slice:
items[:] = [item for item in items if is_match(item)]
(2) iterate over it in reverse order:
for item in reversed(items):
if not ismatch(item):
items.remove(item)
Below is a way to integrate method 1 in your code:
[...]
# set lookup is more efficient than lookup in a list
il = set(x.strip() for x in open(ilf))
def matching_recs(recs):
return (rec for rec in recs if rec.find("part/pid").text in il)
xmlet = None
for xml in xmls:
inv = ET.parse(xml).getroot()
if xmlet is None:
xmlet = inv
# replace all recs with matching recs
xmlet[:] = matching_recs(inv)
else:
# append only matching recs
xmlet.extend(matching_recs(inv))
ET.ElementTree(xmlet).write(xo)
# the script will end happily without a quit() or exit() call
# quit()
At least with your sample data
> for rec in d.iter('inv'):
iterates over a single node (the root) so I omitted that loop.
More information about the Tutor
mailing list