Re: [DOC-SIG] What does this mean for Python?
1) SAX driver for xmllib 2) xmlproc uses SAX natively instead of using a driver, although it will probably need to add some things beyond SAX later
That gives us well-formedness-checking and a simple standardized event-based API.
Building on that I'd planned on making:
1) A simple ESIS outputter, for demo/testing purposes. 2) A grove builder, eventually with DOM support, although there are things I dislike about DOM. 3) A validator.
I also wanted to be able to have groves, validation or both.
What say ye, good people?
Ouch, my head hurts. Does anyone have a good reference (website, book, whatever) to recommend that covers all important aspects of XML and stuff like groves, validations, and all the related acronyms? Cheers /F _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
Okay, let's play acronym expansion. (BTW, with so much XML/SGML activity projected for the next few months, I think we really should have a mailing list or SIG for now, I'll keep doc-sig in the loop as I was instructed the last time we discuss this stuff) Fredrik Lundh wrote:
1) SAX driver for xmllib
SAX ("Simple API for XML") is an event-driven API for getting information out of SGML documents. http://www.microstar.com/XML/SAX/ It has all of the usual benefits of APIs. You can swap in your favourite (fastest, or most convenient) parser.
2) xmlproc uses SAX natively instead of using a driver, although it will probably need to add some things beyond SAX later
xmlproc is Lars' software. When he says he "uses it natively" instead of "through a driver", I think he means that his software is not yet set up to drop in someone else's parser easily.
That gives us well-formedness-checking and a simple standardized event-based API.
Well-formedness-checking is simple syntactic checking. SAX is the simple, standradized event-based API.
Building on that I'd planned on making:
1) A simple ESIS outputter, for demo/testing purposes.
ESIS is a simple linearized format for the output of SGML documents where every element starts on a line, attributes are on their own lines and so forth. ESIS is not SGML. It's like a "pickle" of SGML. I would encourage Lars to use a newer XML linearization format: http://www.jclark.com/xml/canonxml.html
2) A grove builder, eventually with DOM support, although there are things I dislike about DOM.
A grove is an abstract model for the in-memory representation of SGML documents. The DOM ("Document Object Model") is a world wide web consortium API for accessing the contents of an SGML document. In other words the grove represents the data model and the DOM is a particular API for providing access to it. So where SAX concentrates on generating *events* for stream-based handling of documents, the DOM is an API for explicitly traversing and navigating an in-memory tree.
3) A validator.
A validator reads the declarations in the document type definition and verifies that the document conforms to it. Paul Prescod - http://itrc.uwaterloo.ca/~papresco Our lives shall not be sweated from birth until life closes; Hearts starve as well as bodies; give us bread, but give us roses. - http://www.columbia.edu/~melissa/petronella/songs/bread-roses.html _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
* Paul Prescod | | Okay, let's play acronym expansion. Fredrik: sorry about the headache. I sort of assumed that people were familiar with XML terminology, which was of course a mistake. Thank you for doing the expansion, Paul. :) | xmlproc is Lars' software. That's right. It assumes pretty much the same role as xmllib: parsing a raw XML document and providing hooks for applications that want to do something with the data. | When he says he "uses it natively" instead of "through a driver", I | think he means that his software is not yet set up to drop in | someone else's parser easily. Sorry, what I meant was that it doesn't use a SAX driver, but instead speaks SAX natively, so to speak. It looks like that approach will become cumbersome as xmlproc becomes more complete, so I may have to use another approach later. | I would encourage Lars to use a newer XML linearization format: | | http://www.jclark.com/xml/canonxml.html Thanks for that pointer, Paul! I'll add support for canonical XML output to saxlib since that looks like it can be very useful for testing parsers. | So where SAX concentrates on generating *events* for stream-based | handling of documents, the DOM is an API for explicitly traversing | and navigating an in-memory tree. It's worth noting here that one can build a DOM implementation using the information that comes out of the SAX API so that the DOM library is completely independent of whatever parser is used. This means that if we have a C XML parser and some Python ones that all have SAX drivers the DOM library can use whichever of these happens to be available in each particular installation. Don Park has already made such a DOM implementation on top of SAX in Java, called SAXDOM[1]. I've now made a naive SAX driver for xmllib and added it to my web page[2] together with the ESIS outputter. It's not complete since I don't know how complete xmllib is, but once I add the canonical XML outputter I can test that easily. It's all extremely simple, but should provide a reasonable demonstration of the potential of SAX for now. I will try to improve this to comply more fully with the spec later. With SAX support in both xmllib and my own incomplete xmlproc I was able to do some speed comparisons. For good measure I threw in James Clarks XP[3] parser written in Java (and written to be as fast as possible) and DataChannels DXP[4] Java parser. Here are the results on my 166 MHz Pentium: Time to run hamlet.xml through validation and grove building via SAX: Parser 1st 2nd 3rd Avg xmllib.py 50.1 48.4 49.8 49.4 xmlproc.py 40.8 39.4 39.5 39.9 xp.java 1.49 1.43 1.43 1.45 dxpcl.java 14 - - 14 With no validation or grove building (empty document handler): Parser 1st 2nd 3rd Avg xmllib.py 38.6 37.2 38.7 38.2 xmlproc.py 32.5 33 32 32.5 The numbers speak for themselves, I think. I'll have to read the XP sources closely to see whatever James Clark did to XP to make it that fast. (The comparison between xmllib and xmlproc is not entirely fair since I've still got to add some stuff to xmlproc that will slow it down, but then I haven't tried optimizing it yet either.) [1] <URL:http://users.quake.net/donpark/saxdom.html> [2] <URL:http://www.stud.ifi.uio.no/~larsga/download/python/xml/> [3] <URL:http://www.jclark.com/xml/xp/index.html> [4] <URL:http://www.datachannel.com/products/xml/DXP/> -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
On Thu, Mar 12 1998 Lars Marius Garshol wrote:
With SAX support in both xmllib and my own incomplete xmlproc I was able to do some speed comparisons. For good measure I threw in James Clarks XP[3] parser written in Java (and written to be as fast as possible) and DataChannels DXP[4] Java parser.
Here are the results on my 166 MHz Pentium:
Time to run hamlet.xml through validation and grove building via SAX:
Parser 1st 2nd 3rd Avg xmllib.py 50.1 48.4 49.8 49.4 xmlproc.py 40.8 39.4 39.5 39.9 xp.java 1.49 1.43 1.43 1.45 dxpcl.java 14 - - 14
With no validation or grove building (empty document handler):
Parser 1st 2nd 3rd Avg xmllib.py 38.6 37.2 38.7 38.2 xmlproc.py 32.5 33 32 32.5
The numbers speak for themselves, I think. I'll have to read the XP sources closely to see whatever James Clark did to XP to make it that fast.
I have a question about the timings here. How was the data fed to the XML parser in xmllib.py? If you do python xmllib.py hamlet.xml the data is fed to the parser one character at the time. But it is also possible to feed everything at once. There are very significant performance differences between these two methods: If the XML parser sees that a tag is incomplete (usually after parsing the first part of the tag), it saves the data until you feed more data. This means that if you feed the data one character at the time, tags will be parsed partially many times before they are parsed completely, slowing down the process quite a bit.
(The comparison between xmllib and xmlproc is not entirely fair since I've still got to add some stuff to xmlproc that will slow it down, but then I haven't tried optimizing it yet either.)
I haven't done any optimisations in xmllib either. One obvious optimization is to use regex instead of re (but I am not planning to do that). -- Sjoerd Mullender <Sjoerd.Mullender@cwi.nl> <URL:http://www.cwi.nl/~sjoerd/> _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
At 12:38 13.03.98 +0100, Sjoerd Mullender wrote:
I have a question about the timings here. How was the data fed to the XML parser in xmllib.py? If you do python xmllib.py hamlet.xml the data is fed to the parser one character at the time.
I think I fed it to the parser in 16K blocks, but I don't actually remember how I did it. Anyway, I will add a timer application to saxlib, so that anyone can do their own speed testing and modify it as they wish. (I hope that satisfies you as well, Jack.) I'll release that tonight (when I get home from work) together with a driver for David Scherers XML-Toolkit (announced on comp.lang.python on Wednesday). Hopefully I'll be able to get xmlproc out some time during the weekend. The really important issue here, I think, is standardizing the parser APIs. We now have Dan Connolys XML scanner/parser, xmllib and David Scherers parser, with at least two more coming up. I'm still waiting for reactions to my SAX proposal. What do you people out there think? Does it look usable? Should we make it the standard Python API or should we scrap it? Or should we modify it? Should we change the method names to be more Python-like? And can it be used with JPython to interoperate with things like Don Parks SAXDOM? All comments/thoughts on this would be very welcome.
I haven't done any optimisations in xmllib either. One obvious optimization is to use regex instead of re (but I am not planning to do that).
I also use re and don't have any intention of changing, either. Sjoerd, please don't feel threatened by my making my own parser. I did it partly for fun and partly to better understand the interplay between XML entities, well-formedness checking, validation, grove building and what actually goes to the application. So it was not because of dissatisfaction with xmllib, but because I wanted to understand these things better. In fact, when I use xmllib with the SAX canonical XML outputter I seem to get the same results that James Clarks XP gives, so it looks as though xmllib pretty much follows the standard. (I haven't done any rigorous testing, just tested some features I were uncertain about.) I've been telling my colleagues here at STEP Infotek (an SGML firm) about this Python/XML effort and at least two of them (who now use Java and Perl) reacted with "Hmmm... Maybe I should start using Python for my XML work." One of them has even printed out the Python tutorial already. So I think this can be very beneficial for Python if we do it right. And I definitely agree with Sean McGrath: Python is infinitely much better than Perl for this kind of thing. Having a healthy crop of XML parsers and tools written in Python would help make this clear to people. In fact, I now have a list with links to free XML tools and it looks as though I should split the parser section into Java parsers, Python parsers and other parsers. IMHO, that's the kind of thing that would make an impression on people getting into XML and looking for tools. Just my $0.02, of course. --Lars M. _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
On Fri, Mar 13 1998 Lars Marius Garshol wrote:
At 12:38 13.03.98 +0100, Sjoerd Mullender wrote:
I have a question about the timings here. How was the data fed to the XML parser in xmllib.py? If you do python xmllib.py hamlet.xml the data is fed to the parser one character at the time.
I think I fed it to the parser in 16K blocks, but I don't actually remember how I did it.
16K blocks shouldn't give to much extra overhead because of the reparsing, so the figures should be pretty close to optimal for xmllib.
Sjoerd, please don't feel threatened by my making my own parser. I did it partly for fun and partly to better understand the interplay between XML entities, well-formedness checking, validation, grove building and what actually goes to the application. So it was not because of dissatisfaction with xmllib, but because I wanted to understand these things better.
I don't feel threatened. I was the first to create an XML parser for Python, and nobody can take that away. :-)
In fact, when I use xmllib with the SAX canonical XML outputter I seem to get the same results that James Clarks XP gives, so it looks as though xmllib pretty much follows the standard. (I haven't done any rigorous testing, just tested some features I were uncertain about.)
I looked hard at the XML spec when implementing it, so I feel pretty confident that it is reasonably close. I did some more work after 1.5 came out, so my current version is even better (though not necessarily faster). -- Sjoerd Mullender <Sjoerd.Mullender@cwi.nl> <URL:http://www.cwi.nl/~sjoerd/> _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
At 13:26 13.03.98 +0100, Sjoerd Mullender wrote:
I don't feel threatened. I was the first to create an XML parser for Python, and nobody can take that away. :-)
True enough. xmllib is also part of the standard distribution, which is another point in your favour. :)
I looked hard at the XML spec when implementing it, so I feel pretty confident that it is reasonably close. I did some more work after 1.5 came out, so my current version is even better (though not necessarily faster).
Do you have a URL to it? It would be nice to both have the newest version and to be able to link to xmllib specifically and not just as part of the standard distribution. --Lars M. _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
On Fri, Mar 13 1998 Lars Marius Garshol wrote:
At 13:26 13.03.98 +0100, Sjoerd Mullender wrote:
I don't feel threatened. I was the first to create an XML parser for Python, and nobody can take that away. :-)
True enough. xmllib is also part of the standard distribution, which is another point in your favour. :)
But that could be taken away from me. :-)
I looked hard at the XML spec when implementing it, so I feel pretty confident that it is reasonably close. I did some more work after 1.5 came out, so my current version is even better (though not necessarily faster).
Do you have a URL to it? It would be nice to both have the newest version and to be able to link to xmllib specifically and not just as part of the standard distribution.
ftp://ftp.cwi.nl/pub/sjoerd/xmllib.tar.gz http://www.cwi.nl/ftp/sjoerd/xmllib.tar.gz -- Sjoerd Mullender <Sjoerd.Mullender@cwi.nl> <URL:http://www.cwi.nl/~sjoerd/> _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
New snapshot of ``PyDOM'' (DOM-light in python) available at: http://www.math.jussieu.fr/~fermigie/python/PyDOM/ Didn't receive any comments until now. Cheers, S. _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
participants (5)
-
Fredrik Lundh -
Lars Marius Garshol -
Paul Prescod -
Sjoerd Mullender -
Stefane Fermigier