python xml DOM? pulldom? SAX?
alanmk at hotmail.com
Mon Aug 29 20:58:09 CEST 2005
>>SAX is perfect for the job. See code below.
> depends on your definition of perfect...
Obviously, perfect is the eye of the beholder ;-)
> using a 20 MB version of jog's sample, and having replaced
> the print statements with local variable assignments, I get the
> following timings:
> 5 lines of cElementTree code: 7.2 seconds
> 60+ lines of xml.sax code: 63 seconds
> (Python 2.4.1, Windows XP, Pentium 3 GHz)
At first, I thought your code sample was building a tree for the entire
document, so I checked the API docs. It appeared to me that an event
processing model *couldn't* obtain the text for the node when notified
of the node: the text node is still in the future.
That's when I understood the nature of iterparse, which must generate an
event *after* the node is complete, and it's subdocument reified. That's
also when I understood the meaning of the "elem.clear()" call at the
end. Only the required section of the tree is modelled in memory at any
given time. Nice.
There are some minor inefficiencies in my pure python sax code, e.g.
building the stack expression for every evaluation, but I left them in
for didactic reasons. But even if every possible speed optimisation was
added to my python code, I doubt it would be able to match your code.
I'm guessing that a lot of the reason why cElementTree performs best is
because the model-building is primarily implemented in C: Both of our
solutions run python code for every node in the tree, i.e. are O(N). But
yours also avoids the overhead of having function-calls/stack-frames for
every single node event, by processing all events inside a single function.
If the SAX algorithm were implemented in C (or Java) for that matter, I
wonder if it might give comparable performance to the cElementTree code,
primarily because the data structures it is building are simpler,
compared to the tree-subsections being reified and discarded by
cElementTree. But that's not of relevance, because we're looking for
python solutions. (Aside: I can't wait to run my solution on a
fully-optimising PyPy :-)
That's another nice thing I didn't know (c)ElementTree could do.
email alan: http://xhaus.com/contact/alan
More information about the Python-list