[XML-SIG] Re: Can anyone recommend a sensible XML parser for Python?

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 09 Sep 2002 13:39:01 -0600

> > Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType,
> > Entity, and Notation nodes is implementation dependent."
> This is why standards compliance is not terribly important to me.  I would
> rather have a useful XML API than a standardized one.

Well, what do you think is the most useful behavior of cloning a document?  Is 
it the one I posted in response to thread?  If so, don't you think the element 
of surprise is too great (I'd be surprised myself at that behavior)?

Wouldn't it be better for Python/XML to offer a *separate*, specialized 
function for cloning nodes, rather than doing weird things with cloneNode?

> > Can you expand a bit more on the actual use case that makes you think you want 
> > to clone a document node?
> I have a template "frame" document.  I want to clone the document, populate it
> with information lifted from other XML files, and then write the resultant
> (cloned) document out.  This is the very first use-case I ever had working with
> XML and it is still the most common.

I see.  It sounds as if a general document duplication function would be of 
use to you.  I agree that this would be useful.  I'm willing to write one and 
add it to xml.dom.ext.

But I don't think this is a use case for node.cloneNode.

> > We choose not to allow it.  Perfectly legal, and I think this is the right 
> > choice.
> Yes, but the point remains that this *used* to work, and now it *doesn't*.

I don't remember.  What did it do when it "worked"?

> This is functionality I found useful.  While I can't comment on the intrinsic
> sense or nonsense of cloning document nodes in DOM, I do know that it's
> difficult to keep track of when features like this appear and disappear in the
> various different XML solutions for Python.

Was it ever documented?  Every software module has undocumented "features" 
that you use at your peril.  I don't think it's fair to complain when these 
appear and disappear.

Then again, the poor state of PyXML documentation in general weakens that 
point of mine, doesn't it?  Ah well.

> Maybe this is the only feature that has done this; I don't know.  It just
> happens that it's a very commonly-used one for me.
> This is just another instance of my general complaint that tracking versioning
> dependencies is not worth the effort for my degenerately simple use-cases for
> XML.
> > You mean you can't require, say PyXML 0.8.1?  Tough crowd you develop for?
> > :-)
> There are still some parties interested in Twisted who are upset that it
> requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am
> likely going to have to backport portions of it to 1.5.2 for some people.  We
> can all thank Red Hat for this inane persistence of ancient python versions,
> but it is sadly the world I live in.

I sympethize.  It's largely because of Red Hat that it took us so long to drop 
1.5 support in 4Suite.  But a couple of months ago we decided it is not worth 
the developemtn and support overhead and ditched support for all versions 
before 2.1.  I sleep better since then  :-)

> > > My main frustration is with packaging.
> > Here you have a point.  Python, PyXML, and a lot of the related packages move
> > very quickly,. and so quickly that they cause all manner of packaging
> > problems.
> This is my main point, and this is the one that the PyXML community can do the
> least to address.  Buggy and idiosyncratic implementations are already in the
> wild, and some apps will depend on those particular bugs and idiosyncrasies.
> If twisted depends on a new or different set of bugs and quirks, I make it
> incompatible with whatever other XML-using applications are out there today.
> Given that XML is an integration technology this is certainly less than
> desirable.
> > There is no easy solution to this.
> Having a project that is precipitously approaching 1.0 myself, I can
> sympathize.  As much as this sort of dependency and compatibility problem has
> bothered me, I *know* there will be people that write apps for Twisted and will
> curse my name when I enhance some functionality later on :-).
> > I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python
> > Business Forum once the effort on Python itself starts to gain legs.  I guess
> > I can count on you to at least help cheerlead?  :-)
> Cheerleading, certainly :-).  Although I'm less interested in seeing PyXML
> prepared for "business" clients and more interested in just seeing the level of
> QA on the volunteer work go up.  If I *had* any spare "scarce resources" to
> commit beyond my own projects, I would certainly help getting the unit tests
> unified and automated.
> > > or produce what amounts to my own `implementation' of an XML parser.
> > 
> > If you try going this route, I guarantee you'll still be trying to get the 
> > most basic things right six months from now.
> ...
> > > For the applications that I'm intending to write, just doing my own parser and
> > > API is both more appealing and more rewarding.
> > 
> > Really?  Color me deep skeptical.  I have not seen an application on earth 
> > where implementing one's own parser is a good idea, and precious few where 
> > implementing one's own API is a good idea.  I have a lot of colleagues who 
> > have tried.
> While it is *possible* that I'm smarter than you think I am, it is certain that
> I'm more stubborn.

I think you take the wrong gloss on my words.  I think Linus Torvalds himself 
would take years to write a complete and correct XML parser.  It's the nature 
of the beast (XML), not the programmer.

I certainly do not consider myself smart enough to take on that dragon.  I'm 
just glad to lean on folk like Clark (and Drake, Evans and co), Garshol and 

> My sophomoric attempt at an XML parser is now in Twisted
> CVS.

Interesting.  So how did you test it?

> I've had this objection raised over writing yet another a web server, yet
> another remote procedure call protocol, yet another asynchronous socket server
> and yet another database interface.  It seems like at least some of these ideas
> were good ones, so I went ahead and wrote an XML parser and representation
> anyway :-).

I would rather write a Web server, another RPC, another async socket server 
*and* another DBMS interface all in a row than just take on the single task of 
writing an XML parser.  And I think I can speak authoritatively, because I 
*have* implemented all four of those things.

> As a data point for this hypothesis, writing the parser and the node tree took
> me less than half as much time as writing these posts to various mailing lists
> about XML tools (not counting this post, which has been the most
> time-consuming): it took less than a quarter as much time as attempting (and
> failing) to track down bugs in PyXML, not counting the time I spent trying to
> figure out how to turn off undesired features in a way that would work on more
> than one version.  My two main existing PyXML-using applications are already
> ported to this, changing barely any of their code.

As I said, I am very skeptical of the result.  I'll be impressed when you tell 
me your home-brew XML parser passes the OASIS conformance suite.

Anyway, this is all moot argument.  It looks as if you've satisfied yourself 
for now.

Good luck.

Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Track chair, XML/Web Services One Boston: http://www.xmlconference.com/
Basic XML and RDF techniques for knowledge management, Part 7 - 
Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra
Python and XML development using 4Suite, Part 3: 4RDF -