[Tutor] xml parsing from xml

Sat May 10 11:52:10 CEST 2014

Stefan Behnel, 10.05.2014 10:57:
> Danny Yoo, 07.05.2014 22:39:
>> If you don't want to deal with a event-driven approach that SAX
>> emphasizes, you may still be able to do this problem with an XML-Pull
>> parser.  You mention that your input is hundreds of megabytes long, in
>> which case you probably really do need to be careful about memory
>> consumption.  See:
>>
>>     https://wiki.python.org/moin/PullDom
> 
> Since the OP mentioned that the file is quite large (800 MB), not only
> memory consumption should matter but also processing time. If that is the
> case, PullDOM isn't something to recommend since it's MiniDOM based, which
> makes it quite slow overall.

To back that by some numbers, here are three memory efficient
implementations, using PullDOM, cElementTree and lxml.etree:

$ cat lx.py
from lxml.etree import iterparse, tostring

doc = iterparse('input.xml', tag='country')
root = None
for _, node in doc:
    print("--------------")
    print("This is the node for " + node.get('name'))
    print("--------------")
    print(tostring(node))
    print("\n\n")

    if root is None:
        root = node.getparent()
    else:
        sib = node.getprevious()
        if sib is not None:
            root.remove(sib)

$ cat et.py
from xml.etree.cElementTree import iterparse, tostring

doc = iterparse('input.xml')
for _, node in doc:
    if node.tag == "country":
        print("--------------")
        print("This is the node for " + node.get('name'))
        print("--------------")
        print(tostring(node))
        print("\n\n")
        node.clear()

$ cat pdom.py
from xml.dom.pulldom import START_ELEMENT, parse

doc = parse('input.xml')
for event, node in doc:
    if event == START_ELEMENT and node.localName == "country":
        doc.expandNode(node)
        print("--------------")
        print("This is the node for " + node.getAttribute('name'))
        print("--------------")
        print(node.toxml())
        print("\n\n")

I ran all three against a 400 MB XML file generated by repeating the data
snippet the OP provided. Here are the system clock timings in
minutes:seconds, on 64bit Linux, using CPython 3.4.0:

$ time python3 lx.py > /dev/null
time: 0:31
$ time python3 et.py > /dev/null
time: 3:33
$ time python3 pdom.py > /dev/null
time: 9:51

Adding to that another bit of actual tree processing, if I had to choose
between 2 minutes and well over 20 minutes processing time for my 800MB,
I'd tend to prefer the 2 minutes.

Note that the reason why cElementTree performs so poorly here is that its
serialiser is fairly slow, and the code writes the entire 400 MB of XML
back out. If the test was more like "parse 400 MB and generate CSV from
it", then it should perform similar to lxml. PullDOM/MiniDOM, on the other
hand, are slow on parsing, serialisation *and* tree processing.

Stefan