New subject: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

Sept. 7, 2002

      ...
----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
...
BTW, from what you're describing above, you are *not* parsing XHTML.  If it 
violates the DTD, it is not XHTML.  Period.
...
Just say you're parsing "HTML" and don't mention a version.  That's the only 
way to say it correctly  ;-)
OK, "XML which browsers will render".  I am not parsing HTML, in that I won't
accept XML that is not well-formed.  I suppose I could try to wrap HtmlParser
with minidom... yuck.  Gross, but probably a good idea, come to think of it :)
I can't imagine why this would be gross.  IMO, it's illustrates very admirable 
technique, and one of the strengths of Python/XML.  The parsing mechanism and 
the generated representation are independent of each other, so you can mix 
them and match them in order to take advantage of the most needed features of 
either.

We put a lot of work into making this possible, and I find it very elegant.  
C++ folks took ages before they cottonned on to such an approach (in the STL), 
and now it has them in raptures (generic programming is all the rage).  Of 
course, old strait-jacket Java can't touch this.  Too bad for them.
...
The reason I mentioned the cloneNode bug is because it is the most reliable and
the most trivial to demonstrate.  Like I said; at some point, I will clean up
my complaints and submit some bug reports.  Here's a "bare test case" of that
particular spurious accusation:
glyph@zelda:~% python
    Python 2.2.1 (#1, Aug 30 2002, 09:36:47) 
    [GCC 2.95.4 20011002 (Debian prerelease)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from xml.dom.minidom import parseString
    >>> parseString("<hello_world/>").cloneNode(1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 186, in cloneNode
        clone = _clone_node(self, deep, self.ownerDocument or self)
      File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1248, in _clone_node
        elif node.nodeType == PROCESSING_INSTRUCTION_NODE:
    NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined
...
...
...
from xml.dom.minidom import parseString
parseString("<hello_world/>").cloneNode(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
You see, this is why reporting such "bugs" early is helpful.  I could have 
told you ages ago that it is a *bad* idea to call cloneNode on a Document 
object.

Accordin to the DOM Level 2 spec:

"And, cloning Document, DocumentType, Entity, and Notation nodes is 
implementation dependent."

IOW, yer gets what yer gets and can't really complain  :-)

Can you expand a bit more on the actual use case that makes you think you want 
to clone a document node?

I do agree that the confused error message is a glitch.  Current PyXML CVS 
gives a more straightforward "sod off"  :-)

line 198, in cloneNode
    clone = _clone_node(self, deep, self.ownerDocument or self)
  File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", 
line 1454, in _clone_node
    raise Exception("Cannot clone node %s" % repr(node))
Exception: Cannot clone node <xml.dom.minidom.Document instance at 0x82e9cbc>

We choose not to allow it.  Perfectly legal, and I think this is the right 
choice.
...
In order to do the work I want to do, though, those bug reports aren't going to
help.  Even if you resolved every bug report that I submitted within a week, I
would be stuck in the same place I am now: I have to work around the bugs in a
bunch of old versions of PyXML
You mean you can't require, say PyXML 0.8.1?  Tough crowd you develop for?  :-)
...
or produce what amounts to my own `implementation' of an XML parser.
If you try going this route, I guarantee you'll still be trying to get the 
most basic things right six months from now.
...
Granted, if I packaged a newer, fixed-up
version of PyXML with Twisted, I wouldn't have to be mucking about with bits
and bytes -- but I *would* have to understand the entire ontology of confusion
associated with cross-language XML APIs.
My main frustration is with packaging.  If all the world were running Debian
unstable, I'd be fine: I'd just say Depends: python2.2-xml >= 0.9.  However,
with lots of users in Windows, and many more on other linux platforms with less
pleasant package management, every new package that Twisted requires is another
fifteen minutes that the software takes to get running.  It's already confusing
enough to understand it when it *works*; I want the process of getting it
running to be as seamless as possible :).
Here you have a point.  Python, PyXML, and a lot of the related packages move 
very quickly,. and so quickly that they cause all manner of packaging problems.

There is no easy solution to this.  Python is much more of a volunteer 
community than, say JAva.  People work on Python and PyXML mostly to scratch 
their itches, which means they have less incentive to worry about the 
packaging mess they leave behind.

This is the impetus for the Python-in-a-tie effort for Python proper.  I do 
think we'd make a lot more friends if there were a matching PyXML-in-a-tie.  
It would mean companies would have to commit scarce resources to freezing 
interfaces and then testing and packaging to oblivion.

I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python 
Business Forum once the effort on Python itself starts to gain legs.  I guess 
I can count on you to at least help cheerlead?  :-)
...
For the applications that I'm intending to write, just doing my own parser and
API is both more appealing and more rewarding.
Really?  Color me deep skeptical.  I have not seen an application on earth 
where implementing one's own parser is a good idea, and precious few where 
implementing one's own API is a good idea.  I have a lot of colleagues who 
have tried.

By all means, if you'd like to try, go ahead.
...
Neither DOM nor SAX will
present an API which allows me to get network XML events in quite the way I
want, so I'm going to have to do some wrapping.
I have learned through my own bitter experience that you do not want network 
interfaces to have *anything* to do with the lexical XML layer (or even 
Infoset).  It is best to design network interactions around *application* 
level semantics.  Basically sending around chunks of XML text is far less 
hazardous than what I think you mean.
...
(I do wish pyRXP were
event-based... it's very close, in spirit, to what I want.)  If the general
quality of XML parsers in Python were really high, I would regard this impulse
as contrary and counterproductive -- why write my own library for doing this
when perfectly good ones already exist and and are deployed all over the place?
Well, as I said, I don't see any evidence that the quality of XML parsers in 
Python is not high.  You pointed out one problem in cloneNode which, from what 
I gather, was mostly because you're abusing DOM.  This had nothing to do with 
parsing.  Are you speaking generically?
...
So maybe I'm just rationalizing what I would have done anyway.  Nevertheless,
it is easier to write my own XML parser than to even properly report the bugs
that I have thus far discovered.
I find this claim ludicrous on its face.  Writing an XML parser with the 
compliance level and quality of any of the ones in PyXML takes years.  Yes.  
Years.

Feel free to re-learn this fact the hard way, if you wish.
...
For more information my perception of XML, and why my requirements are as
stripped-down as they are, look at the presentation here:
http://xmlsucks.org/but_you_have_to_use_it_anyway/
(Yes, it's a real URL, and it's not mine.)
Yes.  I'd guess we've all seen that link.  <shrug>  So what useful technology 
doesn't suck?  XML works for me.  Your mileage may vary.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Track chair, XML/Web Services One Boston: http://www.xmlconference.com/
Basic XML and RDF techniques for knowledge management, Part 7 - 
http://www-106.ibm.com/developerworks/xml/library/x-think12.html
Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra
ry/x-jclark.html
Python and XML development using 4Suite, Part 3: 4RDF - 
http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle...
1EA5A2CF4621C386256BBB006F4CEC

Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

Uche Ogbuji

Fred L. Drake, Jr.

Glyph Lefkowitz

Fredrik Lundh

Fred L. Drake, Jr.

Glyph Lefkowitz

Fredrik Lundh

tags

participants (4)