[Python-Dev] Integrating Expat

M.-A. Lemburg mal@lemburg.com
Mon, 01 Oct 2001 16:50:45 +0200


Martin von Loewis wrote:
> 
> > Are you sure that we should choose expat as "native" XML parser ?
> 
> It wouldn't necessarily be the only parser. To process XML, different
> applications have different needs. However, since the expatreader is
> the only SAX reader included in the standard library at the moment,
> guaranteeing presence of pyexpat is oft-requested. Notice that
> pyexpat.c is also in the standard library already.

Just wanted to make sure that we still have the option of including
other parsers as well :-)
 
> > There are other candidates which would fit this role just
> > as well (in particular, Fredrik's sgmlop looks like a nice
> > extension since it not only works with XML but also many
> > other meta languages).
> 
> Not that many candidates would work as well. For example, sgmlop has a
> number of known bugs, and a few unknown ones. Guido once complained
> that it is easy to crash sgmlop with ill-formed input, and rejected
> inclusion of sgmlop when xmlrpclib was integrated. A known problem is
> that entity references are not expanded in attributes.

Well, let's put it this way: if someone finds a need to fix these
bugs, it is more likely to happen in the Python core, e.g. xmlrpclib
has already received a few tweaks (by yourself ;-) after it was
checked into the core.

I think that the sgmlop design is sufficiently simple and easy
to extend to make it a good candidate for inclusion. Sure, we'll
get bug reports, but why not add sgmlop marked as experimental
to the core in order to get it stabilized and bug-fixed ?!

I would very much like a sandbox like part in the Python standard
dist to encourage stabilizing of proposed-to-be-included std
lib extensions, e.g. how about a sandbox package in the std lib ?!
 
> Beyond that, I'm not aware of many more pure-C parsers that could be
> reasonably be integrated into the core. There are many XML parsers,
> but many of the are written in C++ or Java.

Me neither... except RXP which is written in plain C.
 
> > If you want a very fast validating XML parser, RXP would also
> > be a good choice -- AFAIK, the RXP folks would allow us to
> > ship RXP under a different license than GPL which is then
> > bound to Python.
> 
> RXP would indeed be a choice. Of course, integrating it is much
> harder; you'd have to write the C module first, plus documentation,
> plus a SAX driver, plus test cases. I'm not sure how much code you can
> inherit from PyLTXML.

Sure; the question I wanted to raise was: given that we have such
an interface, would RXP also be a candidate for inclusion ?
 
> On performance: Please have a look at
> 
> http://www.xml.com/lpt/a/Benchmark/exec.html
> 
> which suggests that expat still has a speed advantage over rxp
> (assuming that the measurements where done carefully, i.e. disabling
> validation in RXP).

Hmm, I know that at least one company has been having great
success in using RXP with Python; from their experience RXP
is faster on average XML than any of the other available
(validating) parsers. May be due to their application, though, 
so YMMV.
 
> > Given the many alternatives, I am not sure whether going with
> > expat is the right path... may be wrong though.
> 
> It shouldn't be the only path. pyexpat is already integrated into the
> Python library, all I'm suggesting to give the promise that it will be
> available on every 2.2 Python installation.
> 
> Any volunteers working on RXP integration are certainly welcome to do
> so; code contributions to PyXML will be welcome (provided the GPL
> issue gets resolved). Code contributions to the Python core would
> require some review, of course - it took quite some time to get
> pyexpat stable, and I guess any other C-integrated parser won't work
> from scratch, either.

True.

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/