[Expat-discuss] DOM parser, expat vs. lbxml?

Wed Mar 5 20:26:04 EST 2003

On  4 Mar, AFish at GoldenGate.com wrote:
> We need a DOM parser in C that will compile on any platform. So far, the
> only C xml parsers I have seen are expat and libxml. The only DOM parser

Don't forget rxp, a good, fast, compliant and optional validating
parser http://www.cogsci.ed.ac.uk/~richard/rxp.html (don't get afraid
about the artless home page, it's a good product). If C++ is also OK
for you, there's of course also xerces-c++ (http://xml.apache.org).

Well, and for completeness sakes, don't forget msxml (the XML parser
out of the evil empire). I'm not a fan of MS for various reasons, but
their XML parser (and there XSLT engine) isn't bad.

> build on top of expat I have seen is 'SCEW' the simple C expat wrapper
> (http://www.nongnu.org/scew/).
> 
> Questions:
> 1. Are there other expat wrappers or examples which provide DOM-like xml
> tree traversal?

Sablotron (http://www.gingerall.com/charlie/ga/xml/p_sab.xml). From
the home page:

" Sablotron is a fast, compact and portable XML toolkit implementing
XSLT 1.0, DOM Level2 and XPath 1.0. [...] Sablotron uses James Clark's
expat XML parser."

You better shouldn't buy in there claim, that they have a "fast" XSLT
processor (this claim is somewhat ridiculous). Though, Sablotron
itself is written in C++.

There are for sure more DOM implementations based on expat around,
than only one. For example, there's an Tcl extension (I'm one of the
maintainers), which implements DOM on top of expat (and also XPath and
XSLT) (http://www.tdom.org). The DOM building parts are completetly in
C, so it may worth a look.

> 2. Has anyone done a side-by-side libxml vs. expat comparison? Is there any
> reason we should roll our own DOM parser on top of expat instead of using
> libxml?

There could be said a lot - your question is a bit vague about your
needs.

Expat does not validate (although it does read, on demand, external
entities). If a well-formdness parser is OK for you, expat is
definitely somewhat faster - but since both parsers are really fast,
this may only be of interest, if you aim for maximum
speed. Additionally, the time, needed to build a DOM like structure in
memory (which typically needs a lot of mallocs for the node
structures) isn't negligible, so the overall speed depends not only on
the raw parser speed, but also on the quality of the DOM building
code.

Another factor, which may be important (depending on the size of your
XML data) is, that DOM trees typically need _a lot_ of memory. This
depends of course on how much markup you have in your document (and
how much 'indentation' fluff you have in your document) but it's
normal, that you need 3 to 5 times the file size of memory for the
DOM tree. Although the libxml DOM trees need notable lesser memory
than every Java DOM implementaion, I know, it isn't the slimmest
implementation, avaliable. For example, the above mentioned tDOM
implentation has a notable lesser overhead (which is important for me,
because I've to handle really large product data lists in XML).

DOM and DOM are not the same. Do you mean DOM 1, 2 or 3? What about
entities? Must you preserve parsed entities? DOM alone will probably
make you somewhat unhappy, in short time. Navigation within the tree
can get tedious, if you don't have support for at least XPath (libxml
provides this). But I better stop now.

rolf

> 
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss at libexpat.org
> http://mail.libexpat.org/mailman/listinfo/expat-discuss