[XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

Uche Ogbuji uche.ogbuji@fourthought.com
Sat, 07 Sep 2002 00:10:51 -0600

> ----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)--
> Content-Type: Text/Plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
> On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
> > BTW, from what you're describing above, you are *not* parsing XHTML.  If it 
> > violates the DTD, it is not XHTML.  Period.
> > Just say you're parsing "HTML" and don't mention a version.  That's the only 
> > way to say it correctly  ;-)
> OK, "XML which browsers will render".  I am not parsing HTML, in that I won't
> accept XML that is not well-formed.  I suppose I could try to wrap HtmlParser
> with minidom... yuck.  Gross, but probably a good idea, come to think of it :)

I can't imagine why this would be gross.  IMO, it's illustrates very admirable 
technique, and one of the strengths of Python/XML.  The parsing mechanism and 
the generated representation are independent of each other, so you can mix 
them and match them in order to take advantage of the most needed features of 

We put a lot of work into making this possible, and I find it very elegant.  
C++ folks took ages before they cottonned on to such an approach (in the STL), 
and now it has them in raptures (generic programming is all the rage).  Of 
course, old strait-jacket Java can't touch this.  Too bad for them.

> The reason I mentioned the cloneNode bug is because it is the most reliable and
> the most trivial to demonstrate.  Like I said; at some point, I will clean up
> my complaints and submit some bug reports.  Here's a "bare test case" of that
> particular spurious accusation:
>     glyph@zelda:~% python
>     Python 2.2.1 (#1, Aug 30 2002, 09:36:47) 
>     [GCC 2.95.4 20011002 (Debian prerelease)] on linux2
>     Type "help", "copyright", "credits" or "license" for more information.
>     >>> from xml.dom.minidom import parseString
>     >>> parseString("<hello_world/>").cloneNode(1)
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in ?
>       File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 186, in cloneNode
>         clone = _clone_node(self, deep, self.ownerDocument or self)
>       File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1248, in _clone_node
>         elif node.nodeType == PROCESSING_INSTRUCTION_NODE:
>     NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined

You see, this is why reporting such "bugs" early is helpful.  I could have 
told you ages ago that it is a *bad* idea to call cloneNode on a Document 

Accordin to the DOM Level 2 spec:

"And, cloning Document, DocumentType, Entity, and Notation nodes is 
implementation dependent."

IOW, yer gets what yer gets and can't really complain  :-)

Can you expand a bit more on the actual use case that makes you think you want 
to clone a document node?

I do agree that the confused error message is a glitch.  Current PyXML CVS 
gives a more straightforward "sod off"  :-)

>>> from xml.dom.minidom import parseString
>>> parseString("<hello_world/>").cloneNode(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", 
line 198, in cloneNode
    clone = _clone_node(self, deep, self.ownerDocument or self)
  File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", 
line 1454, in _clone_node
    raise Exception("Cannot clone node %s" % repr(node))
Exception: Cannot clone node <xml.dom.minidom.Document instance at 0x82e9cbc>

We choose not to allow it.  Perfectly legal, and I think this is the right 

> In order to do the work I want to do, though, those bug reports aren't going to
> help.  Even if you resolved every bug report that I submitted within a week, I
> would be stuck in the same place I am now: I have to work around the bugs in a
> bunch of old versions of PyXML

You mean you can't require, say PyXML 0.8.1?  Tough crowd you develop for?  :-)

> or produce what amounts to my own `implementation' of an XML parser. 

If you try going this route, I guarantee you'll still be trying to get the 
most basic things right six months from now.

> Granted, if I packaged a newer, fixed-up
> version of PyXML with Twisted, I wouldn't have to be mucking about with bits
> and bytes -- but I *would* have to understand the entire ontology of confusion
> associated with cross-language XML APIs.
> My main frustration is with packaging.  If all the world were running Debian
> unstable, I'd be fine: I'd just say Depends: python2.2-xml >= 0.9.  However,
> with lots of users in Windows, and many more on other linux platforms with less
> pleasant package management, every new package that Twisted requires is another
> fifteen minutes that the software takes to get running.  It's already confusing
> enough to understand it when it *works*; I want the process of getting it
> running to be as seamless as possible :).

Here you have a point.  Python, PyXML, and a lot of the related packages move 
very quickly,. and so quickly that they cause all manner of packaging problems.

There is no easy solution to this.  Python is much more of a volunteer 
community than, say JAva.  People work on Python and PyXML mostly to scratch 
their itches, which means they have less incentive to worry about the 
packaging mess they leave behind.

This is the impetus for the Python-in-a-tie effort for Python proper.  I do 
think we'd make a lot more friends if there were a matching PyXML-in-a-tie.  
It would mean companies would have to commit scarce resources to freezing 
interfaces and then testing and packaging to oblivion.

I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python 
Business Forum once the effort on Python itself starts to gain legs.  I guess 
I can count on you to at least help cheerlead?  :-)

> For the applications that I'm intending to write, just doing my own parser and
> API is both more appealing and more rewarding.

Really?  Color me deep skeptical.  I have not seen an application on earth 
where implementing one's own parser is a good idea, and precious few where 
implementing one's own API is a good idea.  I have a lot of colleagues who 
have tried.

By all means, if you'd like to try, go ahead.

> Neither DOM nor SAX will
> present an API which allows me to get network XML events in quite the way I
> want, so I'm going to have to do some wrapping.

I have learned through my own bitter experience that you do not want network 
interfaces to have *anything* to do with the lexical XML layer (or even 
Infoset).  It is best to design network interactions around *application* 
level semantics.  Basically sending around chunks of XML text is far less 
hazardous than what I think you mean.

> (I do wish pyRXP were
> event-based... it's very close, in spirit, to what I want.)  If the general
> quality of XML parsers in Python were really high, I would regard this impulse
> as contrary and counterproductive -- why write my own library for doing this
> when perfectly good ones already exist and and are deployed all over the place?

Well, as I said, I don't see any evidence that the quality of XML parsers in 
Python is not high.  You pointed out one problem in cloneNode which, from what 
I gather, was mostly because you're abusing DOM.  This had nothing to do with 
parsing.  Are you speaking generically?

> So maybe I'm just rationalizing what I would have done anyway.  Nevertheless,
> it is easier to write my own XML parser than to even properly report the bugs
> that I have thus far discovered.

I find this claim ludicrous on its face.  Writing an XML parser with the 
compliance level and quality of any of the ones in PyXML takes years.  Yes.  

Feel free to re-learn this fact the hard way, if you wish.

> For more information my perception of XML, and why my requirements are as
> stripped-down as they are, look at the presentation here:
>     http://xmlsucks.org/but_you_have_to_use_it_anyway/
> (Yes, it's a real URL, and it's not mine.)

Yes.  I'd guess we've all seen that link.  <shrug>  So what useful technology 
doesn't suck?  XML works for me.  Your mileage may vary.

Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Track chair, XML/Web Services One Boston: http://www.xmlconference.com/
Basic XML and RDF techniques for knowledge management, Part 7 - 
Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra
Python and XML development using 4Suite, Part 3: 4RDF -