[XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

Fri, 06 Sep 2002 17:31:51 -0600

> Hmm, I know that minidom has had some problems recently, but it has also
> seen some good improvements. It sounds like you need more robust DOM
> support--have you tried 4DOM? It's not as fast, but it does adhere to
> the spec the best.

cDomlette's cloneNode does work.  If minidom's doesn't, a bug report would be 
nice.

> Maybe (when you have time) if you let us know what
> you expect to accomplish we can help out--the people in XML-SIG are some
> of the sharpest in the community. Perhaps TREX or RELAX-NG would be more
> suitable. I guess the only comforting thing I can say is that every
> development community is experiencing growing pains when it comes to an
> XML strategy.
> 
> Good luck,
> 
> Eron
> 
> On Fri, 2002-09-06 at 02:39, Glyph Lefkowitz wrote:
> > 
> > On 05 Sep 2002 17:53:09 -0400, Eron Lloyd <elloyd@lancaster.lib.pa.us> wrote:
> > > Are you referring to PyXML? I know xml.* in the Standard Library is 
> > > pretty weak by far (but getting better!).
> > 
> > Yes.  In fact, PyXML is a big part of the problem.  Its "minidom" module, for
> > example, is *far* buggier than the one found in the standard library.  (As an
> > example of that, try to figure out how to make cloneNode work on a Document
> > object.)

What version of PyXML?

> > I could deal with one set of potential problems and pitfalls using XML in
> > Python and work around then, but I have to work around every combination of
> > versions to make a useful app that doesn't have very stringent installation
> > requirements: in pracitice this means 4 environments: python2.1 with pyxml,
> > python2.1 standalone, python2.2 with pyxml, python2.2 standalone.
> > 
> > I don't want a plethora of XML parsers with rich features, all of which are
> > broken.  I want *one* XML parser that can *reliably* transform a stream of
> > bytes into a stream of nodes, and a text file into a tree of nodes. 

You haven't given any evidence to the effect that PyXML does not have this.  A 
bug in cloneNode has nothing to do with parsing.

> > You
> > mentioned validatation in your post and I explicitly said that validation is
> > worse than useless to me; in most cases I want to parse XHTML, which means
> > dealing with lots of potentially DTD-violating stuff which is still "valid" as
> > far as I'm concerned.

Doesn't HtmlParser do the trick?  If not, you could try 
dom.ext.readers.HtmlReader with a minidom implementation used to override the 
default.

BTW, from what you're describing above, you are *not* parsing XHTML.  If it 
violates the DTD, it is not XHTML.  Period.

Just say you're parsing "HTML" and don't mention a version.  That's the only 
way to say it correctly  ;-)

> > Eventually I'll clean up the problem cases I'm having and submit them as bug
> > reports, but right now it's not worth my time, because I really don't want to
> > deal with the fragility of the PyXML or python-standard-library xml.* stuff.

Well, no one can tell you what to do with your time, but such general comments 
are not very useful.  It's not as if you posted 10 bug reports, then threw up 
your hands and said "I'm blowing this joint".  You made one vague mention of a 
cloneNode bug, without even a bare test case.

No one gets paid to develop PyXML, but if you come our way a bit, we're quite 
willing to help.

-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Track chair, XML/Web Services One Boston: http://www.xmlconference.com/
Basic XML and RDF techniques for knowledge management, Part 7 - 
http://www-106.ibm.com/developerworks/xml/library/x-think12.html
Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra
ry/x-jclark.html
Python and XML development using 4Suite, Part 3: 4RDF - 
http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle/8A
1EA5A2CF4621C386256BBB006F4CEC