[Python-Dev] Expat -- it keeps going, and going, and going ...

Paul Prescod paul@prescod.net
Wed, 09 Feb 2000 07:36:33 -0800


Apologies for the long message. There are a lot of issues to address:

There was a clear concensus at the XML-SIG developer's day discussion
that Expat should become part of the standard distribution. Admittedly
the audience was biased and Fredrik wasn't in the room at that point but
it was clear that everyone was in agreement (in contrast to the doc-sig
discussion!). I think Andrew had some reservations (he was probably
subconsciously channeling Guido) but almost everyone in the room was
strongly behind the idea -- and the room was overfull.

Insofar as this is not a democracy, I feel the need to channel some of
the crowd's opinions and some of my own.

The crowd (and I) obviously thought that XML support is an important
part of coming with "batteries included" on the modern Web. There are
four basic specs maintained by the W3C and IETF that underly the Web:
URLs, HTTP, HTML and now XML. In fact, modern versions of HTML (XHTML)
and HTTP (WebDAV) depend upon XML. Microsoft is also trying to establish
XML-based protocols as replacements for CORBA and as the basis of their
entire Web object model.

Luckily, the important things to know about XML are very simple. It's a
way of encoding hierarchical structures in text using a standard,
language-independent syntax that happens to be compatible with document
markup syntaxes. 

Some other things to know about it are: 

 * it is very rigorously defined 
 * there are test suites to verify implementations
 * it has enough nooks and crannies to be hard to implement
 * xmllib doesn't implement enough of it 
 * and thus isn't a conforming XML parser

xmllib was pretty cool when it was the first XML parser in a general
purpose language. Now it is out of date. It is, however, what we present
to the world as our "XML support." Whatever we do about expat, we need
to decide what to do about the fact that xmllib is not a real XML
processor (plus it is slow as hell!). Writing an XML processor is harder
than it should be and very few people have the patience to pour over the
spec and get it right.

Okay, so of course you know where I am leading. Perl, Apache, Mozilla
and most other C-coded open source software projects embed expat. This
is because expat is blazingly fast, Unicode aware and highly conformant.
It's written in ANSI-C and seems stable as a rock. It changes slowly and
doesn't have a lot of extra features. Best of all, someone else
maintains it and we have wrapped it in a pretty thin C layer which is
easy to maintain. The layer is roughly the same size as xmllib.

Guido astonished me at IPC8 with a level of humility and honesty that is
very rare in this business -- especially coming from a successful
language designer. He said that part of why Python didn't grab a bigger
part of the CGI market was because he didn't understand the importance
of CGI to the Web in the early days. He has also not been shy in saying
he doesn't know much about XML. Many of us think that it will be much
bigger than CGI.

One opinion expressed during the meeting is that XML is a big draw for
business, development money and publicity. Okay, having XML in a
separate package is not the same as ignoring it altogether but people
expect these fundamental technologies to be built in. As soon as you
split them out you run into versioning and distribution issues. Yes,
distutils will help, but I don't think it will do everything. I don't
know of any package management system that can automatically correct
version skew problems. The only "system" that works is full-distribution
testing.

Some feel that we should install PyExpat but not expat. The problem is
installation, especially on Windows. It is demonstrably the case that
windows programmers are ALREADY nervous about installing the XML
toolkit. I got two personal emails about how to install last week (where
do people get my email address??) and the XML-SIG list got one or two
also. If we install pyexpat without expat, we'll have versioning
problems, path problems, multiple DLL problems and so forth. If we
statically bind expat and pyexpat the problems go away (on windows at
least). There are rumours that some Unixes are not smart in the same
situation. This can be solved by renaming symbols before building. This
can be accomplished with the C pre-processor.

Expat+Pyexpat is about 100K. My Python directory is 35MB so I'm not too
worried. I think that the compressed Python tarball is more than 5MB
now, isn't it?

I'm not big on the idea of multiple Python "distributions" because in
practice there will be only two: the portable one and the
Windows-specific one. We'll still have to write emails like this
imploring the (two) maintainers to support XML or whatever and we may
have divergence between the two versions.

Distributions make sense in the Linux case because there is a lot of
money going around, there is money to be made on shrink-wrapped boxes
and it is important to optimize for different cases. For Python, the
freebsd model of "the same everywhere" is more appropriate. If that
means a more distributed standard library maintenance mechanism, then
fine, let's work that out. I don't expect Guido to maintain PyExpat or
Expat any more than Larry Wall maintains the Perl XML parser layer or
Brian B. maintains the XML support in Apache himself.

If we can get concensus on this issue, I will approach James Clark for a
more Pythonic license. Right now it has an MPL license but I suspect
that James will be flexible.

Therefore the concrete proposal is:

 * add expat, pyexpat and a thin SAX layer to the standard Python
distribution
 * rename symbols in expat if necessary
 * deprecate xmllib
 * continue development of the XML toolkit for non-core tools
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
The calculus and the rich body of mathematical analysis to which it gave
rise made modern science possible, but it was the algorithm that made
the
modern world possible.
	- The Advent of the Algorithm (pending), by David Berlinski