[XML-SIG] minidom changes & additional modules

Fred L. Drake, Jr. fdrake@acm.org
Fri, 15 Feb 2002 00:07:15 -0500

Please send comments on this discussion & proposal to xml-sig@python.org.

minidom changes & additional modules

At Python 10, the problem of performance for the Python DOM
implementations came up (again).  There are several implementations of
the DOM either already available or coming available in the near
future (a few are in alpha/beta releases).  I'm not sure what the
status of those will be, or how they'll perform, but many will include
C code rather than being 100% pure Python implementations.  I think it
would be nice to have a pure Python DOM available as part of both the
PyXML package and the standard library.

At the moment, that means minidom.  While not perfect, there just
haven't been a lot of complaints about bugs, so I'm inclined to
consider it fairly stable code.  It will be easier to get a new
revision of minidom into the standard library than it would be to
replace it, specifically because of the familiarity and stability of
the code.

The performance problem is substantial for many users, however, and
the most often cited problems seem to be the time it takes to
construct a new document object from XML text, and the memory size of
the document.  While it won't be possible to solve all the performance
and resource problems by tweaking the implementation, there are a
number of areas in which improvements are not difficult to achieve and
won't cause significant de-stabilization concerns.

I'd like to describe my post-conference efforts and see if others
think this is a reasonable approach.

Optimizing minidom

As its name suggests, minidom is a fairly minimal implementation of
the DOM; there are many corner cases it doesn't check, and it
certainly doesn't enforce all the read-only constraints specified in
the DOM recommendations.  As such, it is wonderfully "Pythonic", and
subscribes whole-heartedly to the "we're all adults here" philosophy.

However, the minidom implementation does use the constructors of the
various node types to initialize based on whatever parameters are
passed in (mostly), even though the class constructors are not
specified by the DOM (only the factory functions are specified).
Moreover, the constructors generally cascade to the base
Node.__init__() constructor, which does very little, but does provide
one debugging hook.  This hook, which causes all nodes to be added to
a table when debugging is enabled, is used to make sure that all nodes
are freed when the Node.unlink() method is called; the regression test
checks that the table is empty after the appropriate unlink() calls
have been made.

Unfortunately, this imposes a serious performance burden.  Even nodes
which do not allow child nodes (such as Text and Comment nodes) end up
with a new NodeList instance (essentially a built-in list), and the
__init__() must be called.  If we remove the Node.__init__() method,
we do lose some measure of debug-ability, but we can relegate NodeList
creation to the constructors of the derived classes and avoid the
method call to the base constructor.  In the case of the Text &
CDATASection node types, we can do the attribute initialization in the
factory functions and avoid having an __init__() method altogether,
allowing these nodes to be much cheaper to construct.

Constructing document instances

The current process of constructing a typical minidom document object
using the minidom.parse() and minidom.parseString() functions is
fairly expensive at runtime.  There's a pyexpat parser, an adapter
that makes it look like a SAX2 reader, a ContentHandler and an event
manager from the pulldom module, plus the calls into the minidom code
itself.  A large portion of the code is simply the adaptors that glue
everything together.

Building a minidom DOM builder directly on top of pyexpat allows us to
avoid about a third of the overhead, and such an implementation can be
based on some code I wrote for the "Parsed XML" product for Zope.
While the code was originally targeted to a DOM written for that
project, most of it is adapting the pyexpat callback parameters to the
DOM structures; modifying it to work with minidom has proved fairly

The minidom.parse() and minidom.parseString() functions can be easily
adapted to use the alternate DOM builder if a specific parser is not
indicated in the parameter list.

As a benefit, the new builder offers an interface much like what is
being proposed for the DOM Level 3 Load specification, including
filtering support.  This makes it easy to drop information that isn't
needed while supporting more of the DOM than the pulldom methods.

The target API for the new interfaces will be the final DOM 3 Load
component, but I don't know when that can be expected to become
final.  The working group is making progress, but it doesn't appear to
be close to completion.  I expect to provide feedback based on
implementation experience.

Current status

I have code that works and supports much of what's proposed in each of
the sections above.  My test data indicates that the speed of building
the DOM has approximately doubled.  The minidom changes alone don't
make a lot of difference, but combined with the new loader we see a
great deal of improvement.

I'd like to check this into the PyXML tree soon so that we can shake
out any problems and maybe squeeze out more performance before
integrating the changes into the standard library.  Before I do this,
I'd like to add Entity and Notation nodes to minidom, since the
builder can extract the required information from Expat.  (But that's
just a few hours' work; the basic implementation for that was already
done for the Parsed XML project.)

Further work

There's a lot more that could be done to improve performance,
primarily by finishing some work on Expat and extending the pyexpat
interface to support reporting the details of each event in an
easier-to-use form, but I'll leave that for another day.  ;-)


Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Zope Corporation