
[I know I've asked this before, but Fred wanted me to ask it again :-] What do you think about an integration of Expat into Python, to be always able to build pyexpat (and with the same version also)? Which version of Expat would you use? Would you put the expat files into a separate directory, or all into modules? Here is my proposal: Integrate Expat 2.95.2 for release together with Python 2.2; into an expat subdirectory of Modules (taking only the lib files of expat). This would affect build procedures on all targets; in particular, pyexpat would not link to a shared expat DLL, but incorporate the object files. Regards, Martin

Martin von Loewis wrote:
Are you sure that we should choose expat as "native" XML parser ? There are other candidates which would fit this role just as well (in particular, Fredrik's sgmlop looks like a nice extension since it not only works with XML but also many other meta languages). If you want a very fast validating XML parser, RXP would also be a good choice -- AFAIK, the RXP folks would allow us to ship RXP under a different license than GPL which is then bound to Python. Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Are you sure that we should choose expat as "native" XML parser ?
It wouldn't necessarily be the only parser. To process XML, different applications have different needs. However, since the expatreader is the only SAX reader included in the standard library at the moment, guaranteeing presence of pyexpat is oft-requested. Notice that pyexpat.c is also in the standard library already.
Not that many candidates would work as well. For example, sgmlop has a number of known bugs, and a few unknown ones. Guido once complained that it is easy to crash sgmlop with ill-formed input, and rejected inclusion of sgmlop when xmlrpclib was integrated. A known problem is that entity references are not expanded in attributes. Beyond that, I'm not aware of many more pure-C parsers that could be reasonably be integrated into the core. There are many XML parsers, but many of the are written in C++ or Java.
RXP would indeed be a choice. Of course, integrating it is much harder; you'd have to write the C module first, plus documentation, plus a SAX driver, plus test cases. I'm not sure how much code you can inherit from PyLTXML. On performance: Please have a look at http://www.xml.com/lpt/a/Benchmark/exec.html which suggests that expat still has a speed advantage over rxp (assuming that the measurements where done carefully, i.e. disabling validation in RXP).
Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though.
It shouldn't be the only path. pyexpat is already integrated into the Python library, all I'm suggesting to give the promise that it will be available on every 2.2 Python installation. Any volunteers working on RXP integration are certainly welcome to do so; code contributions to PyXML will be welcome (provided the GPL issue gets resolved). Code contributions to the Python core would require some review, of course - it took quite some time to get pyexpat stable, and I guess any other C-integrated parser won't work from scratch, either. Regards, Martin

Martin von Loewis wrote:
Just wanted to make sure that we still have the option of including other parsers as well :-)
Well, let's put it this way: if someone finds a need to fix these bugs, it is more likely to happen in the Python core, e.g. xmlrpclib has already received a few tweaks (by yourself ;-) after it was checked into the core. I think that the sgmlop design is sufficiently simple and easy to extend to make it a good candidate for inclusion. Sure, we'll get bug reports, but why not add sgmlop marked as experimental to the core in order to get it stabilized and bug-fixed ?! I would very much like a sandbox like part in the Python standard dist to encourage stabilizing of proposed-to-be-included std lib extensions, e.g. how about a sandbox package in the std lib ?!
Me neither... except RXP which is written in plain C.
Sure; the question I wanted to raise was: given that we have such an interface, would RXP also be a candidate for inclusion ?
Hmm, I know that at least one company has been having great success in using RXP with Python; from their experience RXP is faster on average XML than any of the other available (validating) parsers. May be due to their application, though, so YMMV.
True. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

mal> I think that the sgmlop design is sufficiently simple and easy to mal> extend to make it a good candidate for inclusion. Sure, we'll get mal> bug reports, but why not add sgmlop marked as experimental to the mal> core in order to get it stabilized and bug-fixed ?! I would be happy to sgmlop added to the core. The xmlrpclib encoding and decoding do need some sort of C-based acceleration to be usable: % python testxmlrpc.py testing with xmlrpclib 0.9.8 using FastParser 415 dumps per second 106 loads per second disabling fast parsers in xmlrpclib using SlowParser 412 dumps per second 16.1 loads per second FWIW, the xmlrpclib delivered with Python is substantially slower dumping data than the 0.9.x versions that have been around awhile, though its decoding performance degrades less without sgmlop. Compare the above with this: % PYTHONPATH=~/misc/python/python2 python testxmlrpc.py testing with xmlrpclib 1.0b3 using SgmlopParser 229 dumps per second 94.3 loads per second disabling fast parsers in xmlrpclib using ExpatParser 231 dumps per second 76.8 loads per second I haven't had or taken the time to investigate the difference yet. Skip

Skip Montanaro wrote:
Hmm, you cannot really compare these numbers though, since the two runs use two different sets of parsers. Have you checked using SgmlopParser with the 0.9.8 version of xmlrpclib ? There's also a project on SF called py-xmlrpc which uses a C implementation as basis and is said to be much faster than xmlrpclib (at least that's what they quote on their web-page). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

mal> Hmm, you cannot really compare these numbers though, since the two mal> runs use two different sets of parsers. Have you checked using mal> SgmlopParser with the 0.9.8 version of xmlrpclib ? They are the same parser. I forgot to mention that. What is called "FastParser" in 0.9.8 is called SgmlopParser in the CVS version. That has a different thing called "FastParser". I believe it is the thing you can get by contacting Pythonware. mal> There's also a project on SF called py-xmlrpc which uses a C mal> implementation as basis and is said to be much faster than mal> xmlrpclib (at least that's what they quote on their web-page). Yes, it is. Amazingly enough, the guy who wrote it (Shilad Sen at Sourcelight Technologies) works in the same building I do (and it's a pretty small building). We had lunch last week and talked a bit about it. It doesn't yet do Unicode. I sent Shilad my little test script. He modified it to use his parser. His results suggest that py-xmlrpc is about as fast as cPickle. Skip

Skip Montanaro wrote:
I see. Rereading your numbers suggests that only dumps got slower. Now that you've fixed this in CVS the reason is obvious... from xyz import abc is slow. OTOH, Fredrik mentions that he put in this change in order to decrease startup time for the lib. I guess you can't win 'em all :-)
Cool ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin von Loewis wrote:
How would libxml fit into this picture ? http://xmlsoft.org/ libxml is written in C as well and under the LGPL. There's also Apache's Xerces which is written in a portable subset of C++ (is probably to big though to be intergated into Python): http://xml.apache.org/xerces-c/ -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin von Loewis writes:
[I know I've asked this before, but Fred wanted me to ask it again :-]
Actually, I think I simply suggested the forum so that others could comment as well. ;-)
I have mixed feelings. There are really two things that we could do: We could add Expat to our CVS repository, which means syncing a bunch of files everytime a new Expat release comes out, or we could bundle the Expat sources with the Python source distribution when the distribution is built, but not add them to CVS. This avoids the extra files in CVS, but complicates construction of the distribution and adds a new wrinkle to the configuration management.
For the "Parsed XML" Zope product, we included the sources for the Expat library in our CVS, but added our own configure.in and other build-control files, which are simpler than those included with Expat (since it only needs to build the static library). This seems to work reasonably well, and does not introduce new wrinkles to the configuration management. So I think we agree on the approach to take. M.-A. Lemburg writes:
See Martin's comments about this. I think this precludes inclusion of sgmlop until the problems it has have been addressed in the implementation. I'm not sure what "meta languages" it handles; I thought it only dealt with XML/XHTML and HTML document markup.
Agreed. I think it would be really nice to have an interface for RXP that was easy to build and use. I haven't looked at PyLTXML in a long time, so I'm not sure what state it's in.
Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though.
As Martin said, RXP and Expat together don't really qualify as "many". sgmlop just isn't robust enough (yet), and it's not clear there are other alternatives. There is libxml (a.k.a. gnome-xml), which is licensed under the LGPL; Python bindings for that are described as being in the alpha stage, but I haven't had time to play with them myself. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

I thought MvL had already volunteered to do this?
cannot fix bugs if nobody bothers to report them ;-) (the crash issue appears to be a rumour; there was a bug when running in SGML mode, but that was fixed long ago. people using the current release in real-life applications haven't reported any stability problems...) on the other hand, sgmlop itself will never be anything but a "fast but sloppy" XML tokenizer. if you risk running into xml compliance nazis <0.1 wink>, you shouldn't use it. </F>

Fredrik Lundh writes:
I thought MvL had already volunteered to do this?
I didn't state this was a huge issue or that it didn't have a nice solution. ;-) It also isn't something that happens all that often, given that I don't have a lot of time to make Expat releases.
Glad to hear this! Perhaps someone (not implying you) should start writing a substantial test suite for it to ferret out any remaining bugs? I don't see a test_sgmlop.py in the PyXML package; if you already have something perhaps you could contribute it? It might help you unload maintenance if anyone does manage to find a bug.
"Nazi" would not have been my word for it, but ... Wham! ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

On Sun, Sep 30, 2001 at 04:53:06PM +0200, Martin von Loewis wrote:
Speaking from the experience of bundling Expat directly into the Apache binaries (also using a subset of the original source) ... I think bundling the sources is fine, but it should *ONLY* be a fallback if you do not find the Expat library installed on the system. *ALWAYS* link against a system-installed library first. We ran into a problem that has bothered some Perl users for a long while now. Specifically: Apache 1.3 would get loaded and export the Expat symbols to the rest of the process. Any third-party module that was built *against Apache* (obviously the case since they are Apache modules) and needed Expat would immediately resolve upon loading and be happy. But! What we ran into is mod_perl (linked against Apache) running a Perl script which, in turn, loaded XML::Parsers::Expat. That Perl module linked against *Expat*, not Apache (it is a standard module and has nothing to do with Apache). Well... when the Perl module was loaded, you now had *two* sets of Expat symbols in the process space. Segfaults, bugs, and madness ensued. I just made some fixes this past week to Apache 1.3 to fix the situation somewhat. The basic answer is to always grab a system (.so) library when possible. When the shared lib is present, then both Apache and XML::Parsers::Expat would link against the same thing about loading. And Apache still has the feature of exposing XML to its third-party modules. This situation could easily happen to Python, too. Imagine building Expat directly into pyexpat. Some Python script loads pyexpat and the Expat symbols come with it. Now, some *other* module is loaded and dynamically links against /usr/lib/libexpat.so. Now you have *two* sets of Expat symbols and crashes are going to start happening. -1 on *always* using bundles sources -- they should only be a fallback. +1 on including it as a fallback. Cheers, -g -- Greg Stein, http://www.lyra.org/

Martin von Loewis wrote:
Are you sure that we should choose expat as "native" XML parser ? There are other candidates which would fit this role just as well (in particular, Fredrik's sgmlop looks like a nice extension since it not only works with XML but also many other meta languages). If you want a very fast validating XML parser, RXP would also be a good choice -- AFAIK, the RXP folks would allow us to ship RXP under a different license than GPL which is then bound to Python. Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Are you sure that we should choose expat as "native" XML parser ?
It wouldn't necessarily be the only parser. To process XML, different applications have different needs. However, since the expatreader is the only SAX reader included in the standard library at the moment, guaranteeing presence of pyexpat is oft-requested. Notice that pyexpat.c is also in the standard library already.
Not that many candidates would work as well. For example, sgmlop has a number of known bugs, and a few unknown ones. Guido once complained that it is easy to crash sgmlop with ill-formed input, and rejected inclusion of sgmlop when xmlrpclib was integrated. A known problem is that entity references are not expanded in attributes. Beyond that, I'm not aware of many more pure-C parsers that could be reasonably be integrated into the core. There are many XML parsers, but many of the are written in C++ or Java.
RXP would indeed be a choice. Of course, integrating it is much harder; you'd have to write the C module first, plus documentation, plus a SAX driver, plus test cases. I'm not sure how much code you can inherit from PyLTXML. On performance: Please have a look at http://www.xml.com/lpt/a/Benchmark/exec.html which suggests that expat still has a speed advantage over rxp (assuming that the measurements where done carefully, i.e. disabling validation in RXP).
Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though.
It shouldn't be the only path. pyexpat is already integrated into the Python library, all I'm suggesting to give the promise that it will be available on every 2.2 Python installation. Any volunteers working on RXP integration are certainly welcome to do so; code contributions to PyXML will be welcome (provided the GPL issue gets resolved). Code contributions to the Python core would require some review, of course - it took quite some time to get pyexpat stable, and I guess any other C-integrated parser won't work from scratch, either. Regards, Martin

Martin von Loewis wrote:
Just wanted to make sure that we still have the option of including other parsers as well :-)
Well, let's put it this way: if someone finds a need to fix these bugs, it is more likely to happen in the Python core, e.g. xmlrpclib has already received a few tweaks (by yourself ;-) after it was checked into the core. I think that the sgmlop design is sufficiently simple and easy to extend to make it a good candidate for inclusion. Sure, we'll get bug reports, but why not add sgmlop marked as experimental to the core in order to get it stabilized and bug-fixed ?! I would very much like a sandbox like part in the Python standard dist to encourage stabilizing of proposed-to-be-included std lib extensions, e.g. how about a sandbox package in the std lib ?!
Me neither... except RXP which is written in plain C.
Sure; the question I wanted to raise was: given that we have such an interface, would RXP also be a candidate for inclusion ?
Hmm, I know that at least one company has been having great success in using RXP with Python; from their experience RXP is faster on average XML than any of the other available (validating) parsers. May be due to their application, though, so YMMV.
True. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

mal> I think that the sgmlop design is sufficiently simple and easy to mal> extend to make it a good candidate for inclusion. Sure, we'll get mal> bug reports, but why not add sgmlop marked as experimental to the mal> core in order to get it stabilized and bug-fixed ?! I would be happy to sgmlop added to the core. The xmlrpclib encoding and decoding do need some sort of C-based acceleration to be usable: % python testxmlrpc.py testing with xmlrpclib 0.9.8 using FastParser 415 dumps per second 106 loads per second disabling fast parsers in xmlrpclib using SlowParser 412 dumps per second 16.1 loads per second FWIW, the xmlrpclib delivered with Python is substantially slower dumping data than the 0.9.x versions that have been around awhile, though its decoding performance degrades less without sgmlop. Compare the above with this: % PYTHONPATH=~/misc/python/python2 python testxmlrpc.py testing with xmlrpclib 1.0b3 using SgmlopParser 229 dumps per second 94.3 loads per second disabling fast parsers in xmlrpclib using ExpatParser 231 dumps per second 76.8 loads per second I haven't had or taken the time to investigate the difference yet. Skip

Skip Montanaro wrote:
Hmm, you cannot really compare these numbers though, since the two runs use two different sets of parsers. Have you checked using SgmlopParser with the 0.9.8 version of xmlrpclib ? There's also a project on SF called py-xmlrpc which uses a C implementation as basis and is said to be much faster than xmlrpclib (at least that's what they quote on their web-page). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

mal> Hmm, you cannot really compare these numbers though, since the two mal> runs use two different sets of parsers. Have you checked using mal> SgmlopParser with the 0.9.8 version of xmlrpclib ? They are the same parser. I forgot to mention that. What is called "FastParser" in 0.9.8 is called SgmlopParser in the CVS version. That has a different thing called "FastParser". I believe it is the thing you can get by contacting Pythonware. mal> There's also a project on SF called py-xmlrpc which uses a C mal> implementation as basis and is said to be much faster than mal> xmlrpclib (at least that's what they quote on their web-page). Yes, it is. Amazingly enough, the guy who wrote it (Shilad Sen at Sourcelight Technologies) works in the same building I do (and it's a pretty small building). We had lunch last week and talked a bit about it. It doesn't yet do Unicode. I sent Shilad my little test script. He modified it to use his parser. His results suggest that py-xmlrpc is about as fast as cPickle. Skip

Skip Montanaro wrote:
I see. Rereading your numbers suggests that only dumps got slower. Now that you've fixed this in CVS the reason is obvious... from xyz import abc is slow. OTOH, Fredrik mentions that he put in this change in order to decrease startup time for the lib. I guess you can't win 'em all :-)
Cool ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin von Loewis wrote:
How would libxml fit into this picture ? http://xmlsoft.org/ libxml is written in C as well and under the LGPL. There's also Apache's Xerces which is written in a portable subset of C++ (is probably to big though to be intergated into Python): http://xml.apache.org/xerces-c/ -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Martin von Loewis writes:
[I know I've asked this before, but Fred wanted me to ask it again :-]
Actually, I think I simply suggested the forum so that others could comment as well. ;-)
I have mixed feelings. There are really two things that we could do: We could add Expat to our CVS repository, which means syncing a bunch of files everytime a new Expat release comes out, or we could bundle the Expat sources with the Python source distribution when the distribution is built, but not add them to CVS. This avoids the extra files in CVS, but complicates construction of the distribution and adds a new wrinkle to the configuration management.
For the "Parsed XML" Zope product, we included the sources for the Expat library in our CVS, but added our own configure.in and other build-control files, which are simpler than those included with Expat (since it only needs to build the static library). This seems to work reasonably well, and does not introduce new wrinkles to the configuration management. So I think we agree on the approach to take. M.-A. Lemburg writes:
See Martin's comments about this. I think this precludes inclusion of sgmlop until the problems it has have been addressed in the implementation. I'm not sure what "meta languages" it handles; I thought it only dealt with XML/XHTML and HTML document markup.
Agreed. I think it would be really nice to have an interface for RXP that was easy to build and use. I haven't looked at PyLTXML in a long time, so I'm not sure what state it's in.
Given the many alternatives, I am not sure whether going with expat is the right path... may be wrong though.
As Martin said, RXP and Expat together don't really qualify as "many". sgmlop just isn't robust enough (yet), and it's not clear there are other alternatives. There is libxml (a.k.a. gnome-xml), which is licensed under the LGPL; Python bindings for that are described as being in the alpha stage, but I haven't had time to play with them myself. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

I thought MvL had already volunteered to do this?
cannot fix bugs if nobody bothers to report them ;-) (the crash issue appears to be a rumour; there was a bug when running in SGML mode, but that was fixed long ago. people using the current release in real-life applications haven't reported any stability problems...) on the other hand, sgmlop itself will never be anything but a "fast but sloppy" XML tokenizer. if you risk running into xml compliance nazis <0.1 wink>, you shouldn't use it. </F>

Fredrik Lundh writes:
I thought MvL had already volunteered to do this?
I didn't state this was a huge issue or that it didn't have a nice solution. ;-) It also isn't something that happens all that often, given that I don't have a lot of time to make Expat releases.
Glad to hear this! Perhaps someone (not implying you) should start writing a substantial test suite for it to ferret out any remaining bugs? I don't see a test_sgmlop.py in the PyXML package; if you already have something perhaps you could contribute it? It might help you unload maintenance if anyone does manage to find a bug.
"Nazi" would not have been my word for it, but ... Wham! ;-) -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation

On Sun, Sep 30, 2001 at 04:53:06PM +0200, Martin von Loewis wrote:
Speaking from the experience of bundling Expat directly into the Apache binaries (also using a subset of the original source) ... I think bundling the sources is fine, but it should *ONLY* be a fallback if you do not find the Expat library installed on the system. *ALWAYS* link against a system-installed library first. We ran into a problem that has bothered some Perl users for a long while now. Specifically: Apache 1.3 would get loaded and export the Expat symbols to the rest of the process. Any third-party module that was built *against Apache* (obviously the case since they are Apache modules) and needed Expat would immediately resolve upon loading and be happy. But! What we ran into is mod_perl (linked against Apache) running a Perl script which, in turn, loaded XML::Parsers::Expat. That Perl module linked against *Expat*, not Apache (it is a standard module and has nothing to do with Apache). Well... when the Perl module was loaded, you now had *two* sets of Expat symbols in the process space. Segfaults, bugs, and madness ensued. I just made some fixes this past week to Apache 1.3 to fix the situation somewhat. The basic answer is to always grab a system (.so) library when possible. When the shared lib is present, then both Apache and XML::Parsers::Expat would link against the same thing about loading. And Apache still has the feature of exposing XML to its third-party modules. This situation could easily happen to Python, too. Imagine building Expat directly into pyexpat. Some Python script loads pyexpat and the Expat symbols come with it. Now, some *other* module is loaded and dynamically links against /usr/lib/libexpat.so. Now you have *two* sets of Expat symbols and crashes are going to start happening. -1 on *always* using bundles sources -- they should only be a fallback. +1 on including it as a fallback. Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (6)
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Greg Stein
-
M.-A. Lemburg
-
Martin von Loewis
-
Skip Montanaro