[XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

Glyph Lefkowitz glyph@twistedmatrix.com
Fri, 06 Sep 2002 22:44:42 -0500 (CDT)


----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:

> BTW, from what you're describing above, you are *not* parsing XHTML.  If it 
> violates the DTD, it is not XHTML.  Period.

> Just say you're parsing "HTML" and don't mention a version.  That's the only 
> way to say it correctly  ;-)

OK, "XML which browsers will render".  I am not parsing HTML, in that I won't
accept XML that is not well-formed.  I suppose I could try to wrap HtmlParser
with minidom... yuck.  Gross, but probably a good idea, come to think of it :)

> Well, no one can tell you what to do with your time, but such general comments 
> are not very useful.  It's not as if you posted 10 bug reports, then threw up 
> your hands and said "I'm blowing this joint".  You made one vague mention of a 
> cloneNode bug, without even a bare test case.

The reason I mentioned the cloneNode bug is because it is the most reliable and
the most trivial to demonstrate.  Like I said; at some point, I will clean up
my complaints and submit some bug reports.  Here's a "bare test case" of that
particular spurious accusation:

    glyph@zelda:~% python
    Python 2.2.1 (#1, Aug 30 2002, 09:36:47) 
    [GCC 2.95.4 20011002 (Debian prerelease)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from xml.dom.minidom import parseString
    >>> parseString("<hello_world/>").cloneNode(1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 186, in cloneNode
        clone = _clone_node(self, deep, self.ownerDocument or self)
      File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1248, in _clone_node
        elif node.nodeType == PROCESSING_INSTRUCTION_NODE:
    NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined

In order to do the work I want to do, though, those bug reports aren't going to
help.  Even if you resolved every bug report that I submitted within a week, I
would be stuck in the same place I am now: I have to work around the bugs in a
bunch of old versions of PyXML or produce what amounts to my own
`implementation' of an XML parser.  Granted, if I packaged a newer, fixed-up
version of PyXML with Twisted, I wouldn't have to be mucking about with bits
and bytes -- but I *would* have to understand the entire ontology of confusion
associated with cross-language XML APIs.

My main frustration is with packaging.  If all the world were running Debian
unstable, I'd be fine: I'd just say Depends: python2.2-xml >= 0.9.  However,
with lots of users in Windows, and many more on other linux platforms with less
pleasant package management, every new package that Twisted requires is another
fifteen minutes that the software takes to get running.  It's already confusing
enough to understand it when it *works*; I want the process of getting it
running to be as seamless as possible :).

For the applications that I'm intending to write, just doing my own parser and
API is both more appealing and more rewarding.  Neither DOM nor SAX will
present an API which allows me to get network XML events in quite the way I
want, so I'm going to have to do some wrapping.  (I do wish pyRXP were
event-based... it's very close, in spirit, to what I want.)  If the general
quality of XML parsers in Python were really high, I would regard this impulse
as contrary and counterproductive -- why write my own library for doing this
when perfectly good ones already exist and and are deployed all over the place?

So maybe I'm just rationalizing what I would have done anyway.  Nevertheless,
it is easier to write my own XML parser than to even properly report the bugs
that I have thus far discovered.

> No one gets paid to develop PyXML, but if you come our way a bit, we're quite 
> willing to help.

I appreciate that.  At some point I hope to have the time to run down every
last bug I've found and help PyXML to become very robust.  (I know that my
requirements are at least a little esoteric; I don't plan for Twisted to be a
general-purpose XML processing toolkit!)  Despite my various problems with it,
PyXML *is* what got me to see why XML might be worthwhile and kind of cool in
some circumstances.

For more information my perception of XML, and why my requirements are as
stripped-down as they are, look at the presentation here:

    http://xmlsucks.org/but_you_have_to_use_it_anyway/

(Yes, it's a real URL, and it's not mine.)

-- 
 |    <`'>    |  Glyph Lefkowitz: Traveling Sorcerer   |
 |   < _/ >   |  Lead Developer,  the Twisted project  |
 |  < ___/ >  |      http://www.twistedmatrix.com      |

----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)--
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQA9eXYwvVGR4uSOE2wRAlm6AJ9wx1ca8rTQ7sHHXeAAM36O5s2PgwCeOy1a
DidtNC/SRvQm/3pYWA0CAOI=
=jET9
-----END PGP SIGNATURE-----

----Security_Multipart(Fri_Sep__6_22:44:42_2002_475)----