Why is xml.dom.minidom so slow?
Martin v. Löwis
martin at v.loewis.de
Fri Jan 3 00:01:10 CET 2003
"Bjorn Pettersen" <BPettersen at NAREX.com> writes:
> If I'm reading the minidom/pulldom files correctly this should use
Yes, that is the only possible interpretation if no other parsers are
> As a test, I tried building my own tree directly from the Expat
> events. This was about 4 times faster (2.89 accts/sec), but still
> far from fast enough... I'm starting to think a custom C++ parser
> might be the way to go (and here I was having such a nice day
I see. Then I would suggest that the mere parsing speed is not the
issue - this uses roughly all tricks we can think of. It still would
be interesting to find out where the computation time is spend. If
these are complicated documents (i.e. many elements and attributes,
short PCDATA), then surely memory allocation is an issue - you could
try Python 2.3a1 also, as a test (pymalloc should give some
improvements when there are many memory allocations).
I doubt that a custom parser can do much better, unless it allows you
to drop data you are not interested in.
What *has* been demonstrated to be a speed-up over minidom is to use
4Suite's cDomlette. It is faster, because:
- it allocates less objects: many things are stored in the elements
themselves, instead of in dictionaries, as Python classic classes
- object creation is through C, with no need to lookup Python methods
over and over again.
When completed, it still gives you a Python-conforming DOM tree. That
DOM tree misses some of the DOM functionality, though, that's why they
call it a Domlette.
> :-) Unfortunately they're not my requirements. (They go something
> :like: "we will eventually need all the data, so put them in a form
> :that the next step can traverse to put into a DB".) If you think a
> :different approach is better I'm all ears :-)
The stream-processing approaches are *much* faster, in all
languages. They don't create intermediate objects, but present you
with just the strings that the parser had to extract from the
In order of increasing speed, decreasing standards conformance:
- SAX: depending on how you design the content handler, you can be
much faster than a DOM builder already. As a test, you might want to
plug in an empty ContentHandler, and see how many documents you
can parse without processing in a certain time.
- Expat raw interface: parsing is XML-conforming, but the API of
Expat is proprietary. This safes indirections, and is again faster.
You can apply the same benchmark with little effort.
- PyXML//F sgmlop: to my knowledge, the fastest for-Python XML
parser, but it misses a number of XML features (e.g. it won't
do entity expansion).
In any case, please report what your findings are and what technology
you eventually use.
More information about the Python-list