[XML-SIG] minidom changes & additional modules

Uche Ogbuji uche.ogbuji@fourthought.com
Fri, 15 Feb 2002 08:59:38 -0700

> At the moment, that means minidom.  While not perfect, there just
> haven't been a lot of complaints about bugs, so I'm inclined to
> consider it fairly stable code.  It will be easier to get a new
> revision of minidom into the standard library than it would be to
> replace it, specifically because of the familiarity and stability of
> the code.

Sure.  Mike Olson also working to make minidom amenable enough for pDOmlette 
to be eliminated.  The assumption is that this can be done without affecting 
minidom performance.

> However, the minidom implementation does use the constructors of the
> various node types to initialize based on whatever parameters are
> passed in (mostly), even though the class constructors are not
> specified by the DOM (only the factory functions are specified).
> Moreover, the constructors generally cascade to the base
> Node.__init__() constructor, which does very little, but does provide
> one debugging hook.  This hook, which causes all nodes to be added to
> a table when debugging is enabled, is used to make sure that all nodes
> are freed when the Node.unlink() method is called; the regression test
> checks that the table is empty after the appropriate unlink() calls
> have been made.
> Unfortunately, this imposes a serious performance burden.  Even nodes
> which do not allow child nodes (such as Text and Comment nodes) end up
> with a new NodeList instance (essentially a built-in list), and the
> __init__() must be called.  If we remove the Node.__init__() method,
> we do lose some measure of debug-ability, but we can relegate NodeList
> creation to the constructors of the derived classes and avoid the
> method call to the base constructor.  In the case of the Text &
> CDATASection node types, we can do the attribute initialization in the
> factory functions and avoid having an __init__() method altogether,
> allowing these nodes to be much cheaper to construct.

Sounds fine to me.

> Building a minidom DOM builder directly on top of pyexpat allows us to
> avoid about a third of the overhead, and such an implementation can be
> based on some code I wrote for the "Parsed XML" product for Zope.
> While the code was originally targeted to a DOM written for that
> project, most of it is adapting the pyexpat callback parameters to the
> DOM structures; modifying it to work with minidom has proved fairly
> easy.
> The minidom.parse() and minidom.parseString() functions can be easily
> adapted to use the alternate DOM builder if a specific parser is not
> indicated in the parameter list.
> As a benefit, the new builder offers an interface much like what is
> being proposed for the DOM Level 3 Load specification, including
> filtering support.  This makes it easy to drop information that isn't
> needed while supporting more of the DOM than the pulldom methods.
> The target API for the new interfaces will be the final DOM 3 Load
> component, but I don't know when that can be expected to become
> final.  The working group is making progress, but it doesn't appear to
> be close to completion.  I expect to provide feedback based on
> implementation experience.

This also seems fine to me.  I would, however, like to merge the DOM L&S 
InputSource class with the SAX2 InputSource class, so we can have some 
uniformity and simplicty in URI and entity resolution and handling across 
PyXML (and 4Suite).

This might involve some changes to our use of DOM L&S, which wouldn't bother 
me a bit.

I hope to put my proposal on the matter forth today.

I guess Tom Passim had a point with the PyXML PEP idea.  :-)

> I have code that works and supports much of what's proposed in each of
> the sections above.  My test data indicates that the speed of building
> the DOM has approximately doubled.  The minidom changes alone don't
> make a lot of difference, but combined with the new loader we see a
> great deal of improvement.
> I'd like to check this into the PyXML tree soon so that we can shake
> out any problems and maybe squeeze out more performance before
> integrating the changes into the standard library.  Before I do this,
> I'd like to add Entity and Notation nodes to minidom, since the
> builder can extract the required information from Expat.  (But that's
> just a few hours' work; the basic implementation for that was already
> done for the Parsed XML project.)

OK.  But even though I agree with most of it, I think you should check this 
into a branch.

Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Boulder, CO 80301-2537, USA
XML strategy, XML tools (http://4Suite.org), knowledge management