http://www.perl.com/perl-xml.html ---- How to Make Perl The Language of Choice for XML Perl has been the language of choice for anyone doing serious text processing. Now efforts are underway to make Perl the language of choice for those doing "structured" text processing using the Extensible Markup Language (XML). The XML 1.0 specification was recently (Feb. 10, 1998) released as a recommendation by the World Wide Web Consortium. XML is a subset of SGML (Standard Generalized Markup Language) and it seems to be emerging as a universal syntax for defining non-proprietary document markup and data formats. XML made significant changes to SGML to reflect the nature of the Web and to make it easier to build tools that process XML. Tim Bray, co-editor of the XML 1.0 specification, has used Perl extensively for huge text processing applications. He had a special interest in seeing a bridge built from Perl to XML -- one that would make it simple for programmers to process XML data. So, out of this interest, a small group of developers met at O'Reilly & Associates in Sebastopol, California for a one-day Perl/XML summit. In addition to Tim, those attending the summit were: Larry Wall, creator of Perl, and senior developer, O'Reilly & Associates Dick Hardt, developer of Perl for Win 32, and Chief Technology Officer, ActiveState Tool Corp. Tim O'Reilly, President and CEO, O'Reilly & Associates Dale Dougherty, CEO, Songline Studios Gina Blaber, Director, Software Products Group, O'Reilly & Associates. "In the design of XML, we were continuously mindful of the need to enable the fast, efficient creation of scripts and programs for processing XML," says Tim Bray. ---- My commentary: Perl has nothing to recommend it over Python right now. In fact this February's Dr. Dobbs already has an article on Python and XML. The only snag is Unicode support. Perl doesn't have Unicode support but Larry has promised it. "One of the summit group's first priorities is to get Perl working with Unicode (ISO 1046). Unicode enables code to be easily translated into other languages; XML requires Unicode. Larry Wall will lead the team working on this task." Many people equate CGI and Perl. I would hate to see that happen with XML and hope I can help to stop that from happening. In the short term, I will integrate JPython with a Java XML parser and write a tutorial on how to use that (JPython inherits Unicode support from Java, right?). Paul Prescod - http://itrc.uwaterloo.ca/~papresco Can we afford to feed that army, while so many children are naked and hungry? Can we afford to remain passive, while that soldier-army is growing so massive? - "Gabby" Barbadian Calpysonian in "Boots" _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
Paul Prescod writes:
How to Make Perl The Language of Choice for XML
Thanks for finding this, Paul. Now, how should we respond? 1) The String-SIG's been pretty dead lately; I've been posting the odd bugfix patch for the PCRE code, and that's about it. Can we please start considering a Unicode string type? This would kill two birds with one stone, since Unicode is important both for XML and for Mark Hammond's PythonWin. 2) The JPython idea is a good one. 3) What about XML support for CPython? I'd like to be able to do XML processing without requiring external programs such as SP or nsgmls. Writing an XML DTD parser, and after that a well-formedness verifier, has therefore been on my project list for a bit. I'll push it up in importance. Once we can parse DTDs, we could write an XML parser that created a tree (or grove, or whatever the precise terminology is) for a document. (A module that read SP's output would still be useful, of course.) 4) What else is there that could be done? Perhaps, if the attempt to convert the documentation to XML is begun, that large application will drive development of further XML tools. What seem like useful deliverables? A.M. Kuchling http://starship.skyport.net/crew/amk/ Dream casts a human shadow, when it occurs to him to do so. -- From SANDMAN: "Season of Mists", episode 0 _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
I think that perhaps the most useful thing that can be done to pry xml away from the perl evangelists would be to work on making XSL a more attractive alternative. To do this XSL probably needs to be seen as a general XML to XML transformation language (or perl (PML?) will fill that niche). Then construct a version of XSL with embedded Python (for string manipulation and suchlike) and with Python objects which could be used for constructing XML structures. Finally build a simple display engine and voila.... (I've been actually thinking about doing something like this myself maybe trying to use jade as a back end but jade is so complex...) Though it seems so futile to resist the perl-borg juggernaut. I had to fight hard to write cgi scripts in Python instead of perl - after all "perl is the only language for CGI" -- jeff putnam - jefu@knowledge2000.com - knowledge 2000 _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
On the Python marketing side, we could actually use this as an opportunity to get some publicity (if that interests the powers that be). Media loves a horse race and if we promote comparisons with Perl, people will at least know that Perl isn't the only language in the class. "We have nothing to lose but our obscurity." Andrew Kuchling wrote:
1) The String-SIG's been pretty dead lately; I've been posting the odd bugfix patch for the PCRE code, and that's about it. Can we please start considering a Unicode string type? This would kill two birds with one stone, since Unicode is important both for XML and for Mark Hammond's PythonWin.
And also more generally for being the best scripting language in the world :) (and not just in English speaking countries).
3) What about XML support for CPython? I'd like to be able to do XML processing without requiring external programs such as SP or nsgmls. Writing an XML DTD parser, and after that a well-formedness verifier, has therefore been on my project list for a bit. I'll push it up in importance. Once we can parse DTDs, we could write an XML parser that created a tree (or grove, or whatever the precise terminology is) for a document. (A module that read SP's output would still be useful, of course.)
I've written that latter (nsgmls output) module. I haven't done a lot of Python work since the Rise of XML, so I have nothing XML specific. I would say that instead of writing an XML parser in Python (probably not fast enough), or writing one from scratch in C (a bunch of needless work), we should start with James Clark's XMLTok, which is written in ANSI C. The "tree" you describe should probably be a W3C DOM[1]. That spec. isn't totally solid yet, but it is usually better to conform to a shifting standard than a completely proprietary API of your own designing. There should also be an event interface based on SAX[2]. [1] http://www.w3.org/TR/WD-DOM/ [2] http://www.microstar.com/XML/SAX/ I think that JPython gets most of this for "free" with a very little bit of glue. All I need to do is document how to use the glue. Paul Prescod - http://itrc.uwaterloo.ca/~papresco Can we afford to feed that army, while so many children are naked and hungry? Can we afford to remain passive, while that soldier-army is growing so massive? - "Gabby" Barbadian Calpysonian in "Boots" _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
[This is the last message I'll be cross-posting to both the Doc-SIG and String-SIG. The Doc-SIG is "a forum for discussing both the form and content of Python documentation" (from Michael McLay's description) and not document processing in general, so I conclude that the String-SIG is more appropriate.] Paul Prescod writes:
be). Media loves a horse race and if we promote comparisons with Perl, people will at least know that Perl isn't the only language in the class. "We have nothing to lose but our obscurity."
We also promote flame wars. It's better to stand alone, and to demonstrate how much simpler the job is in Python.
I would say that instead of writing an XML parser in Python (probably not fast enough), or writing one from scratch in C (a bunch of needless work), we should start with James Clark's XMLTok, which is written in ANSI C.
Fred Drake just said much the same thing to me, but I'm interested in pure Python processing for particular personal purposes. I'd like to be able to do XML processing on various machines, from my home machine to the ones at work to starship, preferably without having to install C extensions or external SGML parsers. Perhaps we can follow string/strop's lead, and provide a Python version, replacing it with a faster-but-compatible version if the C extension is available.
I think that JPython gets most of this for "free" with a very little bit of glue. All I need to do is document how to use the glue.
So there's one deliverable. Another deliverable: an XML-HOWTO which provides an overview of Python and XML processing. I'll happily work on that. Sean McGrath wrote about XMLTok:
James has specifically designed it to be integrated into other applications. I do not think this would take very long and was hoping to have a shot at it myself:-( Volunteer C extension developers, please take one step forward.
So that's probably another deliverable: a C interface to XMLTok. I just received Lars Marius Garshol's message; that code is certainly going to be worth a look when it's released, and perhaps it can be made to use XMLtok if available. The November issue of Linux Journal will be about Web programming languages, and they've already agreed to one Python article; another one about XML would probably interest them, too. I'll commit to that as well. Deliverables: * JPython glue and documentation * XML HOWTO * C interface to XMLTok * Code to parse a document and return a grove. * At least one magazine article about XML & Python. A.M. Kuchling http://starship.skyport.net/crew/amk/ What a terrible thing to have lost one's mind. Or not to have a mind at all. How true that is. -- J. Danforth Quayle _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
Paul Prescod wrote:
I will integrate JPython with a Java XML parser and write a tutorial on how to use that (JPython inherits Unicode support from Java, right?).
This sounds cool. JPython does inherit Unicode support from Java in the standard string objects. The string and re modules are also designed to handle Unicode strings. I should warn you that I haven't tested this functionality much at all (the curse of being an English speaker). -Jim _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
Excerpts from ext.python: 11-Mar-98 [DOC-SIG] Re: [STRING-SIG] .. Jim Hugunin@CNRI.Reston. (626*)
JPython does inherit Unicode support from Java in the standard string objects.
Hmmm. In Python, strings are also used to represent arbitrary byte sequences. Will this feature interact unfortunately with the Unicode-based string support? Bill _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
* Paul Prescod | | Many people equate CGI and Perl. I would hate to see that happen | with XML and hope I can help to stop that from happening. In the | short term, I will integrate JPython with a Java XML parser and | write a tutorial on how to use that For the last week I've been working on a validating XML parser in pure Python that builds a grove-like tree. Initially, I built on xmllib and added a DTD parser, a catalog file handler and some grove objects. I'm now making my own replacement for xmllib (xmlproc), which I'm planning to integrate with the existing validator/grove builder. At this stage I'm able to produce basic ESIS output from xmlproc and the validator/tree builder is advanced enough to parse Tim Brays plays and religious texts. In fact I wrote a small script that went through the tree and counted the number of speeches and lines for each character in a play.[1] The downside is that it takes 50 seconds to parse Hamlet (275 k) on my Win95 Pentium 166MHz, which is much too slow. This is how I envisioned my XML package: 1) SAX driver for xmllib 2) xmlproc uses SAX natively instead of using a driver, although it will probably need to add some things beyond SAX later That gives us well-formedness-checking and a simple standardized event-based API. Building on that I'd planned on making: 1) A simple ESIS outputter, for demo/testing purposes. 2) A grove builder, eventually with DOM support, although there are things I dislike about DOM. 3) A validator. I also wanted to be able to have groves, validation or both. The main catches I see here are: 1) Lack of Unicode support 2) Lack of speed IMHO the solution to the speed is to do the xmllib/xmlproc part in C, possibly via XMLTok, like Paul suggested. I think we should have a Python version of this as well, and thanks to SAX, we can have our cake and eat it too. Given the reaction from people to this Perl thing I'm uncertain as to what I should do. Perhaps I should rush out a minimal package consisting of a SAX shell, an ESIS outputter building on it and a SAX driver for xmllib? That would give any C volunteers something to build towards and those who want to deal with the grove/validation part something to build from. What say ye, good people? [1] Hamlet has 359 speeches and 1459 lines, more than three times what any other character in Hamlet/Tempest/Romeo&Juliet has. Kenneth Branagh must have a photograpic memory. :) -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
Lars Marius Garshol writes:
Given the reaction from people to this Perl thing I'm uncertain as to what I should do. Perhaps I should rush out a minimal package consisting of a SAX shell, an ESIS outputter building on it and a SAX driver for xmllib? That would give any C volunteers something to build towards and those who want to deal with the grove/validation part something to build from.
I'm willing to do the xmltok C module. It will be a week or so before I can get to it; I really need to get the Python documentation source distribution finished up. -Fred -- Fred L. Drake, Jr. fdrake@cnri.reston.va.us Corporation for National Research Initiatives 1895 Preston White Drive Reston, VA 20191 _______________ DOC-SIG - SIG for the Python Documentation Project send messages to: doc-sig@python.org administrivia to: doc-sig-request@python.org _______________
participants (7)
-
Andrew Kuchling -
Bill Janssen -
Fred L. Drake -
Jefu! -
Jim Hugunin -
Lars Marius Garshol -
Paul Prescod