Martin von Loewis wrote:
Are you sure that we should choose expat as "native" XML parser ?
It wouldn't necessarily be the only parser. To process XML, different applications have different needs. However, since the expatreader is the only SAX reader included in the standard library at the moment, guaranteeing presence of pyexpat is oft-requested. Notice that pyexpat.c is also in the standard library already.
Just wanted to make sure that we still have the option of including other parsers as well :-)
There are other candidates which would fit this role just as well (in particular, Fredrik's sgmlop looks like a nice extension since it not only works with XML but also many other meta languages).
Not that many candidates would work as well. For example, sgmlop has a number of known bugs, and a few unknown ones. Guido once complained that it is easy to crash sgmlop with ill-formed input, and rejected inclusion of sgmlop when xmlrpclib was integrated. A known problem is that entity references are not expanded in attributes.
Well, let's put it this way: if someone finds a need to fix these bugs, it is more likely to happen in the Python core, e.g. xmlrpclib has already received a few tweaks (by yourself ;-) after it was checked into the core. I think that the sgmlop design is sufficiently simple and easy to extend to make it a good candidate for inclusion. Sure, we'll get bug reports, but why not add sgmlop marked as experimental to the core in order to get it stabilized and bug-fixed ?! I would very much like a sandbox like part in the Python standard dist to encourage stabilizing of proposed-to-be-included std lib extensions, e.g. how about a sandbox package in the std lib ?!
Beyond that, I'm not aware of many more pure-C parsers that could be reasonably be integrated into the core. There are many XML parsers, but many of the are written in C++ or Java.
Me neither... except RXP which is written in plain C.
If you want a very fast validating XML parser, RXP would also be a good choice -- AFAIK, the RXP folks would allow us to ship RXP under a different license than GPL which is then bound to Python.
RXP would indeed be a choice. Of course, integrating it is much harder; you'd have to write the C module first, plus documentation, plus a SAX driver, plus test cases. I'm not sure how much code you can inherit from PyLTXML.
Sure; the question I wanted to raise was: given that we have such an interface, would RXP also be a candidate for inclusion ?
On performance: Please have a look at
http://www.xml.com/lpt/a/Benchmark/exec.html
which suggests that expat still has a speed advantage over rxp (assuming that the measurements where done carefully, i.e. disabling validation in RXP).
Hmm, I know that at least one company has been having great success in using RXP with Python; from their experience RXP is faster on average XML than any of the other available (validating) parsers. May be due to their application, though, so YMMV.
Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though.
It shouldn't be the only path. pyexpat is already integrated into the Python library, all I'm suggesting to give the promise that it will be available on every 2.2 Python installation.
Any volunteers working on RXP integration are certainly welcome to do so; code contributions to PyXML will be welcome (provided the GPL issue gets resolved). Code contributions to the Python core would require some review, of course - it took quite some time to get pyexpat stable, and I guess any other C-integrated parser won't work from scratch, either.
True. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/