Re: [XML-SIG] Re: [Twisted-Python] Can anyone recommend a sensible XML parser for Python?

I can't imagine why this would be gross. IMO, it's illustrates very admirable technique, and one of the strengths of Python/XML. The parsing mechanism and the generated representation are independent of each other, so you can mix them and match them in order to take advantage of the most needed features of either. We put a lot of work into making this possible, and I find it very elegant. C++ folks took ages before they cottonned on to such an approach (in the STL), and now it has them in raptures (generic programming is all the rage). Of course, old strait-jacket Java can't touch this. Too bad for them.
You see, this is why reporting such "bugs" early is helpful. I could have told you ages ago that it is a *bad* idea to call cloneNode on a Document object. Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, Entity, and Notation nodes is implementation dependent." IOW, yer gets what yer gets and can't really complain :-) Can you expand a bit more on the actual use case that makes you think you want to clone a document node? I do agree that the confused error message is a glitch. Current PyXML CVS gives a more straightforward "sod off" :-) line 198, in cloneNode clone = _clone_node(self, deep, self.ownerDocument or self) File "/home/uogbuji/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 1454, in _clone_node raise Exception("Cannot clone node %s" % repr(node)) Exception: Cannot clone node <xml.dom.minidom.Document instance at 0x82e9cbc> We choose not to allow it. Perfectly legal, and I think this is the right choice.
You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? :-)
or produce what amounts to my own `implementation' of an XML parser.
If you try going this route, I guarantee you'll still be trying to get the most basic things right six months from now.
Here you have a point. Python, PyXML, and a lot of the related packages move very quickly,. and so quickly that they cause all manner of packaging problems. There is no easy solution to this. Python is much more of a volunteer community than, say JAva. People work on Python and PyXML mostly to scratch their itches, which means they have less incentive to worry about the packaging mess they leave behind. This is the impetus for the Python-in-a-tie effort for Python proper. I do think we'd make a lot more friends if there were a matching PyXML-in-a-tie. It would mean companies would have to commit scarce resources to freezing interfaces and then testing and packaging to oblivion. I have had it in mind to suggest a PyXML-in-a-tie type effort in the Python Business Forum once the effort on Python itself starts to gain legs. I guess I can count on you to at least help cheerlead? :-)
For the applications that I'm intending to write, just doing my own parser and API is both more appealing and more rewarding.
Really? Color me deep skeptical. I have not seen an application on earth where implementing one's own parser is a good idea, and precious few where implementing one's own API is a good idea. I have a lot of colleagues who have tried. By all means, if you'd like to try, go ahead.
I have learned through my own bitter experience that you do not want network interfaces to have *anything* to do with the lexical XML layer (or even Infoset). It is best to design network interactions around *application* level semantics. Basically sending around chunks of XML text is far less hazardous than what I think you mean.
Well, as I said, I don't see any evidence that the quality of XML parsers in Python is not high. You pointed out one problem in cloneNode which, from what I gather, was mostly because you're abusing DOM. This had nothing to do with parsing. Are you speaking generically?
I find this claim ludicrous on its face. Writing an XML parser with the compliance level and quality of any of the ones in PyXML takes years. Yes. Years. Feel free to re-learn this fact the hard way, if you wish.
Yes. I'd guess we've all seen that link. <shrug> So what useful technology doesn't suck? XML works for me. Your mileage may vary. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Track chair, XML/Web Services One Boston: http://www.xmlconference.com/ Basic XML and RDF techniques for knowledge management, Part 7 - http://www-106.ibm.com/developerworks/xml/library/x-think12.html Keeping pace with James Clark - http://www-106.ibm.com/developerworks/xml/libra ry/x-jclark.html Python and XML development using 4Suite, Part 3: 4RDF - http://www-105.ibm.com/developerworks/education.nsf/xml-onlinecourse-bytitle... 1EA5A2CF4621C386256BBB006F4CEC

Uche Ogbuji writes:
That's no reason to think its a bad idea to implement it or need it, just that you can't rely on it being supported by an arbitrary DOM implementation.
I do agree that the confused error message is a glitch. Current PyXML CVS gives a more straightforward "sod off" :-)
Not quite; the previous message would have been raised calling cloneNode() on a processing instruction as well. Or calling it with deep=1 on a portion of the tree that contained a processing instruction. That was a real bug, and not an arbitrary limitation.
We choose not to allow it. Perfectly legal, and I think this is the right choice.
Honestly, I think we should implement cloneNode() for Document, simply because not doing so seems an unnecessary limitation. It is not for the library to decide what is right for the application. I agree that not supporting it is legal. The exception that is raised is wrong: it should be xml.dom.NotSupportedErr.
If you try going this route, I guarantee you'll still be trying to get the most basic things right six months from now.
Heck, we're still trying to get Expat right, and it isn't exactly the freshest software around!
That would be nice to have. First task: improve & integrate all the random piles of tests out there! They should all be run when I type "make check" at the top level, not just a handful.
You pointed out one problem in cloneNode which, from what I gather, was mostly because you're abusing DOM. This had nothing to do with
It is not at all clear that this is an abuse of the DOM, as I explained above. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

On Sat, 07 Sep 2002 00:10:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
I suppose I could try to wrap HtmlParser with minidom... yuck. Gross, but probably a good idea, come to think of it :)
I can't imagine why this would be gross.
Sorry, I was saying that making sense of non-XHTML HTML is kind of gross. I did say that it was a good idea, and it's definitely a neat trick.
Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, Entity, and Notation nodes is implementation dependent."
This is why standards compliance is not terribly important to me. I would rather have a useful XML API than a standardized one.
Can you expand a bit more on the actual use case that makes you think you want to clone a document node?
I have a template "frame" document. I want to clone the document, populate it with information lifted from other XML files, and then write the resultant (cloned) document out. This is the very first use-case I ever had working with XML and it is still the most common.
We choose not to allow it. Perfectly legal, and I think this is the right choice.
Yes, but the point remains that this *used* to work, and now it *doesn't*. This is functionality I found useful. While I can't comment on the intrinsic sense or nonsense of cloning document nodes in DOM, I do know that it's difficult to keep track of when features like this appear and disappear in the various different XML solutions for Python. Maybe this is the only feature that has done this; I don't know. It just happens that it's a very commonly-used one for me. This is just another instance of my general complaint that tracking versioning dependencies is not worth the effort for my degenerately simple use-cases for XML.
You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? :-)
There are still some parties interested in Twisted who are upset that it requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am likely going to have to backport portions of it to 1.5.2 for some people. We can all thank Red Hat for this inane persistence of ancient python versions, but it is sadly the world I live in.
My main frustration is with packaging.
This is my main point, and this is the one that the PyXML community can do the least to address. Buggy and idiosyncratic implementations are already in the wild, and some apps will depend on those particular bugs and idiosyncrasies. If twisted depends on a new or different set of bugs and quirks, I make it incompatible with whatever other XML-using applications are out there today. Given that XML is an integration technology this is certainly less than desirable.
There is no easy solution to this.
Having a project that is precipitously approaching 1.0 myself, I can sympathize. As much as this sort of dependency and compatibility problem has bothered me, I *know* there will be people that write apps for Twisted and will curse my name when I enhance some functionality later on :-).
Cheerleading, certainly :-). Although I'm less interested in seeing PyXML prepared for "business" clients and more interested in just seeing the level of QA on the volunteer work go up. If I *had* any spare "scarce resources" to commit beyond my own projects, I would certainly help getting the unit tests unified and automated.
...
While it is *possible* that I'm smarter than you think I am, it is certain that I'm more stubborn. My sophomoric attempt at an XML parser is now in Twisted CVS. I've had this objection raised over writing yet another a web server, yet another remote procedure call protocol, yet another asynchronous socket server and yet another database interface. It seems like at least some of these ideas were good ones, so I went ahead and wrote an XML parser and representation anyway :-). A fellow I know from IRC once said "it's easier to write an s-expression parser for a particular platform by hand than to learn to use any of the XML tools for that platform". I think that if you're interested in keeping your focus narrow in terms of what you do with XML, the same is true of writing an XML parser. As a data point for this hypothesis, writing the parser and the node tree took me less than half as much time as writing these posts to various mailing lists about XML tools (not counting this post, which has been the most time-consuming): it took less than a quarter as much time as attempting (and failing) to track down bugs in PyXML, not counting the time I spent trying to figure out how to turn off undesired features in a way that would work on more than one version. My two main existing PyXML-using applications are already ported to this, changing barely any of their code. Even so, this is almost not a fair comparison because I have several months of experience with those tools on Python 2.1, and I've read a few books on XML already.
Neither DOM nor SAX will present an API which allows me to get network XML events in quite the way I want, so I'm going to have to do some wrapping.
I'm not sure what you think I mean, really, but specifically, I'm thinking particularly of parsing and routing Jabber XML streams. If they are designed in a "hazardous" way then it's not my issue... I don't think much of their protocol design as it is, especially with regard to routing. (As you might guess, I think the whole idea of using XML as a network protocol is rather strange; but Jabber in particular could have been much better done. BEEP, for example, I consider odd, but not broken.)
When I run my particular XML-munging tool, sometimes I get: NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined which we have discussed the reasons for here. Slightly less often, but still with a significant frequency (same python, same PyXML, same input), I get: zsh: segmentation fault ] (doc/howto/basics) I can't present hard evidence for this, I'm sorry, because I'm not familiar with the internals of PyXML or expat and I can't get the bug to happen reliably. If I can ever boil it down to something predictable (i.e. less than 1500 lines of code and half a meg of XML to trigger it) be assured I will make the most complete bug report I can.
Nevertheless, it is easier to write my own XML parser than to even properly report the bugs that I have thus far discovered.
I never claimed to need a parser with PyXML's level of compliance; in fact, I've said several times that compliance at that level is annoying to me because it's too strict. I think we're going to have to agree to disagree on "quality", but at least for my use cases I don't get occasional coredumps from my parser. I cannot substantiate this with real bug reports, so please feel free to dismiss this as FUD if you disagree. From my discussions with other developers near my interest area, however, QA on the PyXML project is notoriously poor, and the quality is wildly variant from release to release. As you yourself have said, this is likely to remain so until someone funds improvements. I do not feel as though I am owed anything in particular by the PyXML project or by any subscriber to any of these lists. In fact, I'm quite grateful for it having provided a nice, simple introduction to the world of XML; I probably would not be using XML today at all if it weren't for the PyXML project. Unfortunately, due to my larger-than-average concerns about dependencies and ease of automating testing for my own project, I don't think that PyXML is a good solution. I need a *very* small XML library, with no strings attached. PyXML is huge, and featureful, and I'm sure in the most recent incarnations it's very robust. It does come with a lot of strings attached though. I have decided it's not worth my time at this point to invest a lot of effort in helping out, until a few versions go by and the general impressions I get from XML developers I work with are becoming more positive. This doesn't mean I won't lend a helping hand when I can, but the communication overhead to working in the PyXML community is not currently worth the gain I would get from it. I wish you the best of luck in making me look foolish for saying that :-). -- | <`'> | Glyph Lefkowitz: Traveling Sorcerer | | < _/ > | Lead Developer, the Twisted project | | < ___/ > | http://www.twistedmatrix.com |

uche wrote:
on the other hand, virtually every commercial XML python user I know of use their own non-pydom parser/sax-style api/dom- style api (with 4thought being the obvious exception, of course). if I couldn't use ElementTree-like apis, I'd probably give up XML programming... (using element trees, Glyph's use case would look something like: tree = deepcopy.deepcopy(template_tree) for node in tree.find(pattern): expand(context, node) tree.write(stream) ) </F>

Uche Ogbuji writes:
That's no reason to think its a bad idea to implement it or need it, just that you can't rely on it being supported by an arbitrary DOM implementation.
I do agree that the confused error message is a glitch. Current PyXML CVS gives a more straightforward "sod off" :-)
Not quite; the previous message would have been raised calling cloneNode() on a processing instruction as well. Or calling it with deep=1 on a portion of the tree that contained a processing instruction. That was a real bug, and not an arbitrary limitation.
We choose not to allow it. Perfectly legal, and I think this is the right choice.
Honestly, I think we should implement cloneNode() for Document, simply because not doing so seems an unnecessary limitation. It is not for the library to decide what is right for the application. I agree that not supporting it is legal. The exception that is raised is wrong: it should be xml.dom.NotSupportedErr.
If you try going this route, I guarantee you'll still be trying to get the most basic things right six months from now.
Heck, we're still trying to get Expat right, and it isn't exactly the freshest software around!
That would be nice to have. First task: improve & integrate all the random piles of tests out there! They should all be run when I type "make check" at the top level, not just a handful.
You pointed out one problem in cloneNode which, from what I gather, was mostly because you're abusing DOM. This had nothing to do with
It is not at all clear that this is an abuse of the DOM, as I explained above. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

On Sat, 07 Sep 2002 00:10:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
On Fri, 06 Sep 2002 17:31:51 -0600, Uche Ogbuji <uche.ogbuji@fourthought.com> wrote:
I suppose I could try to wrap HtmlParser with minidom... yuck. Gross, but probably a good idea, come to think of it :)
I can't imagine why this would be gross.
Sorry, I was saying that making sense of non-XHTML HTML is kind of gross. I did say that it was a good idea, and it's definitely a neat trick.
Accordin to the DOM Level 2 spec: "And, cloning Document, DocumentType, Entity, and Notation nodes is implementation dependent."
This is why standards compliance is not terribly important to me. I would rather have a useful XML API than a standardized one.
Can you expand a bit more on the actual use case that makes you think you want to clone a document node?
I have a template "frame" document. I want to clone the document, populate it with information lifted from other XML files, and then write the resultant (cloned) document out. This is the very first use-case I ever had working with XML and it is still the most common.
We choose not to allow it. Perfectly legal, and I think this is the right choice.
Yes, but the point remains that this *used* to work, and now it *doesn't*. This is functionality I found useful. While I can't comment on the intrinsic sense or nonsense of cloning document nodes in DOM, I do know that it's difficult to keep track of when features like this appear and disappear in the various different XML solutions for Python. Maybe this is the only feature that has done this; I don't know. It just happens that it's a very commonly-used one for me. This is just another instance of my general complaint that tracking versioning dependencies is not worth the effort for my degenerately simple use-cases for XML.
You mean you can't require, say PyXML 0.8.1? Tough crowd you develop for? :-)
There are still some parties interested in Twisted who are upset that it requires Python 2.1; in fact, I felt guilty doing 2.1 support because I am likely going to have to backport portions of it to 1.5.2 for some people. We can all thank Red Hat for this inane persistence of ancient python versions, but it is sadly the world I live in.
My main frustration is with packaging.
This is my main point, and this is the one that the PyXML community can do the least to address. Buggy and idiosyncratic implementations are already in the wild, and some apps will depend on those particular bugs and idiosyncrasies. If twisted depends on a new or different set of bugs and quirks, I make it incompatible with whatever other XML-using applications are out there today. Given that XML is an integration technology this is certainly less than desirable.
There is no easy solution to this.
Having a project that is precipitously approaching 1.0 myself, I can sympathize. As much as this sort of dependency and compatibility problem has bothered me, I *know* there will be people that write apps for Twisted and will curse my name when I enhance some functionality later on :-).
Cheerleading, certainly :-). Although I'm less interested in seeing PyXML prepared for "business" clients and more interested in just seeing the level of QA on the volunteer work go up. If I *had* any spare "scarce resources" to commit beyond my own projects, I would certainly help getting the unit tests unified and automated.
...
While it is *possible* that I'm smarter than you think I am, it is certain that I'm more stubborn. My sophomoric attempt at an XML parser is now in Twisted CVS. I've had this objection raised over writing yet another a web server, yet another remote procedure call protocol, yet another asynchronous socket server and yet another database interface. It seems like at least some of these ideas were good ones, so I went ahead and wrote an XML parser and representation anyway :-). A fellow I know from IRC once said "it's easier to write an s-expression parser for a particular platform by hand than to learn to use any of the XML tools for that platform". I think that if you're interested in keeping your focus narrow in terms of what you do with XML, the same is true of writing an XML parser. As a data point for this hypothesis, writing the parser and the node tree took me less than half as much time as writing these posts to various mailing lists about XML tools (not counting this post, which has been the most time-consuming): it took less than a quarter as much time as attempting (and failing) to track down bugs in PyXML, not counting the time I spent trying to figure out how to turn off undesired features in a way that would work on more than one version. My two main existing PyXML-using applications are already ported to this, changing barely any of their code. Even so, this is almost not a fair comparison because I have several months of experience with those tools on Python 2.1, and I've read a few books on XML already.
Neither DOM nor SAX will present an API which allows me to get network XML events in quite the way I want, so I'm going to have to do some wrapping.
I'm not sure what you think I mean, really, but specifically, I'm thinking particularly of parsing and routing Jabber XML streams. If they are designed in a "hazardous" way then it's not my issue... I don't think much of their protocol design as it is, especially with regard to routing. (As you might guess, I think the whole idea of using XML as a network protocol is rather strange; but Jabber in particular could have been much better done. BEEP, for example, I consider odd, but not broken.)
When I run my particular XML-munging tool, sometimes I get: NameError: global name 'PROCESSING_INSTRUCTION_NODE' is not defined which we have discussed the reasons for here. Slightly less often, but still with a significant frequency (same python, same PyXML, same input), I get: zsh: segmentation fault ] (doc/howto/basics) I can't present hard evidence for this, I'm sorry, because I'm not familiar with the internals of PyXML or expat and I can't get the bug to happen reliably. If I can ever boil it down to something predictable (i.e. less than 1500 lines of code and half a meg of XML to trigger it) be assured I will make the most complete bug report I can.
Nevertheless, it is easier to write my own XML parser than to even properly report the bugs that I have thus far discovered.
I never claimed to need a parser with PyXML's level of compliance; in fact, I've said several times that compliance at that level is annoying to me because it's too strict. I think we're going to have to agree to disagree on "quality", but at least for my use cases I don't get occasional coredumps from my parser. I cannot substantiate this with real bug reports, so please feel free to dismiss this as FUD if you disagree. From my discussions with other developers near my interest area, however, QA on the PyXML project is notoriously poor, and the quality is wildly variant from release to release. As you yourself have said, this is likely to remain so until someone funds improvements. I do not feel as though I am owed anything in particular by the PyXML project or by any subscriber to any of these lists. In fact, I'm quite grateful for it having provided a nice, simple introduction to the world of XML; I probably would not be using XML today at all if it weren't for the PyXML project. Unfortunately, due to my larger-than-average concerns about dependencies and ease of automating testing for my own project, I don't think that PyXML is a good solution. I need a *very* small XML library, with no strings attached. PyXML is huge, and featureful, and I'm sure in the most recent incarnations it's very robust. It does come with a lot of strings attached though. I have decided it's not worth my time at this point to invest a lot of effort in helping out, until a few versions go by and the general impressions I get from XML developers I work with are becoming more positive. This doesn't mean I won't lend a helping hand when I can, but the communication overhead to working in the PyXML community is not currently worth the gain I would get from it. I wish you the best of luck in making me look foolish for saying that :-). -- | <`'> | Glyph Lefkowitz: Traveling Sorcerer | | < _/ > | Lead Developer, the Twisted project | | < ___/ > | http://www.twistedmatrix.com |

uche wrote:
on the other hand, virtually every commercial XML python user I know of use their own non-pydom parser/sax-style api/dom- style api (with 4thought being the obvious exception, of course). if I couldn't use ElementTree-like apis, I'd probably give up XML programming... (using element trees, Glyph's use case would look something like: tree = deepcopy.deepcopy(template_tree) for node in tree.find(pattern): expand(context, node) tree.write(stream) ) </F>
participants (4)
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Glyph Lefkowitz
-
Uche Ogbuji