[XML-SIG] DOM Considered Harmful :-)

Sat, 24 Apr 1999 03:21:02 -0700

All right... I've been slow to respond and only minimal because I was
out this past week (but still had minimal access). I'm leaving in about
three hours to Mexico... I'll have zero access for a week. Of course,
this means that I have the privilege of posting something highly
controversial with the hope that an argument will continue for the next
seven days and I can rejoin it at that time :-)

Okay... seriously, though, I'd like to state my opposition to a DOM, a
subset, or a DOM-like API for a "lightweight" XML parsing solution.

Here are my assumptions/requirements/etc:

1) lightweight means:
   a) fast as possible
   b) conceptually simple for the user
   c) narrow interface (somewhat related to (b))
2) 1b, 1c imply simple doc, so a non-DOM interface is not a hurdle
3) this API is only for consuming XML
4) it is fine to "fall back" to the DOM if the lightweight API doesn't
meet a client's needs
   a) corollary: the ability to swap in alternative parsers is not
required
   b) corollary: ORB compat is not required
   c) corollary: stylistic compatibility (with other language's XML
libraries) is not required

A couple items have been discussed on the list which I'd like to call
out and respond to:

1) the DOM concept of node types

IMO, this is one of the most broken things about the DOM. The child
nodes end up being some random mixture of various element types. Any
client trying to deal with this must *test* each node before they use it
to see if they're looking at the right thing. This is very troublesome.
As a real-world example, when I coded davlib against the DOM and I
needed the first (only) child element of my <propfind> element, there
was no easy and evident approach to this. I knew that child element
would be a <prop>, but what happened was that Text nodes were mixed in.
"oh, well do a findByTagName" or whatever. That wouldn't help on the
next case, where I needed each of the child elements for the <prop>.

Also, look at that answer: "use findByTagName" ... that is simply a
mechanism to get around the fact that the DOM has introduced a
hard-to-use structure. Paul recently followed up to his original
proposal with another proposal to add new methods to his element
objects. Specifically, the getChild() method -- again, this was
introduced *solely* due to the fact that the DOM has a heterogeneous
list of children. The client must apply various filters and other
processing to get useful information. The system must apply tests "is
this the right node type?" here and there.

In one of Paul's original responses to my post, he listed "convenience"
as part of the definition of "lightweight". It sure is, but his response
to making a DOM subset convenient was to introduce helper functions.

I think this is quite broken. As a comparison, the qp_xml module returns
an element that has *only* elements for children. There is no filtering
or other things to get past. The list items are *known*. The text is
stored outside of that list so that you don't have to manually separate
the two all the time. Essentially, qp_xml is easy/convenient
*inherently* rather than patched-up via convenience functions.

In summary, I maintain that any DOM-style system is not inherently
simple or easy to use because of its heterogeneous node lists. I further
believe that something like qp_xml is much nicer all around because its
simplicity/ease/etc originates right from the bottom, rather than being
hidden behind a second layer of API.

Disclaimer: qp_xml does have a convenience function (the textof()
function, which could/should be a method instead). The existence of the
function is based solely on the underlying representation of text
contents, where that design was chosen to be able to retain the document
structure (insofar as elems/text are retained).

2) the close() method and parent/sibling relationships

Adding parent/sibling relationships introduces loops unless you use
proxies or introduce a close() method (if there is another way, then I'd
like to learn it). Proxies are out for efficiency reasons -- objects get
constructed every time you simply want to peek into the data structure.
While the complexity is (mostly) hidden from the client, it is still
there. You don't end up with simple data structures... instead, you get
a lot of "mechanism" in there to deal with intercepting accesses so that
you can create a proxy to bundle up the necessary data.

A close() type method introduces other problems. If you aren't careful,
then it is easy to leak the entire parse tree. What happens if you pass
a subset of the tree to another subsystem? You will have one of two
problems: 1) the client avoids calling close() so the subsystem can use
parent references (this leaks the whole tree); or 2) the client calls
close() so the subsystem only retains its subtree, but now its
(expected) parent/sibling relationsips no longer work. It has a set of
objects that don't fully respond to their published API.

Other alternatives: ways to detach the subtree or specifying that the
elements have two defined states (with and without parent/sibling
relationships). Gee... now we're getting into complex APIs for the
client to deal with.

I'm tremendously in favor of the model returned by qp_xml. You get a set
of simple objects that have no methods. They are really just attribute
retainers. Inside these, you have a *Python* list of children, and a
*Python* mapping of attributes. Nothing fancy. Simple and easy.

Note: personally, I believe that the client can operate quite fine
without parent or sibling pointers. If a function needs an element's
parent, then whoever passed the element should pass the parent, too.
From a conceptual level, I am also a bit shaky on an element knowing
anything about its parents or siblings. It would seem that anything
dealing with a particular element should do so in a context-free manner.

Note 2: if you really need parents/siblings (i.e. it is difficult to
structure your app to avoid them), then you can always fall back to the
DOM.

Okay... now a couple other issues:

* processing instructions.  (thanx Paul for the links)

I looked at the three specs that Paul linked (didn't need the XML spec..
I knew what they were! :-). Two of them, the DDML and DCD specs, use PIs
only as a means of checking the conformance of a document. The document
can be parsed and handled with or without the PIs.

The third: style sheets. Ick. The PI contains actual data, rather than
conformance issues. I note that a Rationale has been appended to the
spec. I bet that was added because the PI is used for more than document
processing (i.e. it alters semantics).

A minimal approach to PIs might be to include *only* the PIs that occur
in the prolog into a list. Since the xml-stylesheet PI can only occur in
the prolog, this approach would pick them up. (not that I like it though
:-)

* note to Paul: the code you posted is broken :-). You apply the default
namespace to attributes that have no prefix. The XML Namespaces spec
states that no prefix on an attribute means "no namespace". You also
fail to distinguish between "no namespace" in the original state of
beginning to parse, and when somebody resets it using xmlns="". In
addition, you reset the default namespace to "no namespace" inside each
startElement.
[ and a Q: why do you have the "xmlns" prefix defined in startElement? ]
[ design comment: I don't think you want to retain prefixes... if
clients believe they can use a prefix that you provide, then problems
will develop *very* quickly. If the client isn't careful, they could end
up with conflicting prefixes. trust me on this one... mod_dav has a
*bitch* of a time dealing with namespace prefixes. I highly recommend
that you drop them; similarly, I believe you should filter the xmlns*
attributes. ]
[ design comment: you should probably index your attributes by (uri,
local) rather than prefix. the client does not know the prefixes ahead
of time, so they will be unable to fetch the attributes. ]

* comments on loss of information (also on "why not use SAX?")

The tree form is very useful. Without it, then an application would need
to implement a state machine to effectively process parsed XML. Seeing a
<prop> element means nothing in itself. When you post-process the tree
and step down thru the tree, the parent <propstat> will place you into
the proper state.

For programmers/clients, the tree model is also very handy. It exists
*outside* of the parsing event. Clients may not be able to structure
their responses to the input to be part of the parsing event stream.

Regarding loss of info: for many applications, the client only needs to
know the contents. The finer details of the document structure are
pointless. These applications are typically using XML as a data transfer
mechanism, rather than a layout mechanism. DAV and XML-RPC are two
examples. PIs and comments are not useful.

I'll send some individual replies to the other emails. This email,
however, is my overall summary and argument against DOM-like APIs.

I maintain that an API such as that provided by qp_xml is very useful
for a particular class of applications. Further, I maintain that it
would be a Good Thing to include qp_xml (or whatever name and with
whatever API/code tweaks) be included into the XML distribution.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/