[Tutor] removing nodes using ElementTree

Mon Jun 29 10:58:26 CEST 2015

street.sweeper at mailworks.org wrote:

> Hello all,
> 
> I'm trying to merge and filter some xml.  This is working well, but I'm
> getting one node that's not in my list to include.  Python version is
> 3.4.0.
> 
> The goal is to merge multiple xml files and then write a new one based
> on whether or not <pid> is in an include list.  In the mock data below,
> the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values
> in my list.  The output is correctly formed xml, but it includes 5 <rec>
> nodes; the 4 in the list, plus 89012 from input1.xml.  It runs without
> error.  I've used used type() to compare
> rec.find('part').find('pid').text and the items in the list, they're
> strings.  When the first for loop is done, xmlet has 8 rec nodes.  Is
> there a problem in the iteration in the second for?  Any other
> recommendations also welcome.  Thanks!
> 
> 
> The code itself was cobbled together from two sources,
> http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-> command-line/11315257#11315257
> and http://bryson3gps.wordpress.com/tag/elementtree/
> 
> Here's the code and data:
> 
> #!/usr/bin/env python3
> 
> import os, glob
> from xml.etree import ElementTree as ET
> 
> xmls = glob.glob('input*.xml')
> ilf = os.path.join(os.path.expanduser('~'),'include_list.txt')
> xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml')
> 
> il = [x.strip() for x in open(ilf)]
> 
> xmlet = None
> 
> for xml in xmls:
>     d = ET.parse(xml).getroot()
>     for rec in d.iter('inv'):
>         if xmlet is None:
>             xmlet = d
>         else:
>             xmlet.extend(rec)
> 
> for rec in xmlet:
>     if rec.find('part').find('pid').text not in il:
>         xmlet.remove(rec)
> 
> ET.ElementTree(xmlet).write(xo)
> 
> quit()

I believe Alan answered your question; I just want to thank you for 
taking the time to describe your problem clearly and for providing 
all the necessary parts to reproduce it.

Bonus part:

Other options to filter a mutable sequence:

(1) assign to the slice:

items[:] = [item for item in items if is_match(item)]

(2) iterate over it in reverse order:

for item in reversed(items):
    if not ismatch(item):
        items.remove(item)

Below is a way to integrate method 1 in your code:

[...]

# set lookup is more efficient than lookup in a list
il = set(x.strip() for x in open(ilf))

def matching_recs(recs):
    return (rec for rec in recs if rec.find("part/pid").text in il)

xmlet = None
for xml in xmls:
    inv = ET.parse(xml).getroot()
    if xmlet is None:
        xmlet = inv
        # replace all recs with matching recs
        xmlet[:] = matching_recs(inv)
    else:
        # append only matching recs
        xmlet.extend(matching_recs(inv))

ET.ElementTree(xmlet).write(xo)

# the script will end happily without a quit() or exit() call
# quit()

At least with your sample data

> for rec in d.iter('inv'):

iterates over a single node (the root) so I omitted that loop.