[lxml-dev] About lxml status

Hello, I am currently using 4Suite-XML for a project on my own, which makes use of DOM, XSLT and XPath. I've always looked around if I could find other libraries to replace 4Suite eventually, and I (re-) discovered libxml today, and found a link to lxml from there. What I'd like to know is (before I dive deeply in documentation) how you would compare the XML support between 4Suite and lxml. Would you say lxml is ready for common DOM manipulations, XSLT transformations and simple XPath usage ? I ask this because I've read in http://xmlsoft.org/index.html: "Document Object Model (DOM) http://www.w3.org/TR/DOM-Level-2-Core/ the document model, but it doesn't implement the API itself, gdome2 does this on top of libxml2" 'Hope I was not offending :) Thanks, -- David.

Hi, David Soulayrol wrote:
lxml does not implement the DOM API either. Instead, as the cheeseshop page nicely states: --------------------- lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more. --------------------- Feel free to find out more from the documentation, it's full of examples: http://codespeak.net/lxml/#documentation Stefan

Hey, Stefan Behnel wrote: [snip]
Note that the ElementTree API is a developing Python standard, implemented by 3 separate libraries, ElementTree, cElementTree and lxml. ElementTree and cElement have become part of the core Python distribution as of Python 2.5. A lot of ElementTree documentation can be found here: http://effbot.org/zone/element-index.htm You can do common DOM-style manipulations through this API, just in a more convenient manner. As to XPath and XSLT support, lxml has that, including the ability to create extension functions and the like. There are differences in feature set here and there, but overall lxml should be able to compete with 4Suite. Regards, Martijn

Greetings! This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and predicates? Also, congratulations on the thorough Xinclude support. Other than 4Suite, LXML is one of the few parsers that handle the "parse=" attribute and the "fallback" elements correctly. -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Martijn Faassen Sent: Monday, December 11, 2006 2:31 PM To: lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Hey, Stefan Behnel wrote: [snip]
Note that the ElementTree API is a developing Python standard, implemented by 3 separate libraries, ElementTree, cElementTree and lxml. ElementTree and cElement have become part of the core Python distribution as of Python 2.5. A lot of ElementTree documentation can be found here: http://effbot.org/zone/element-index.htm You can do common DOM-style manipulations through this API, just in a more convenient manner. As to XPath and XSLT support, lxml has that, including the ability to create extension functions and the like. There are differences in feature set here and there, but overall lxml should be able to compete with 4Suite. Regards, Martijn _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Hello, Lee Brown wrote:
This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and predicates?
Could you explain how lxml is lacking in support for axes and predicates? Possibly you've only been looking at '.find()'. which is the ElementTree compatible buth limited xpath implementation, and not at the full '.xpath()' functionality? Regards, Martijn

This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and
Greetings! I apologize for my mistaken presumption. I presumed it only supported basic Xpath functions because the examples on the lxml API web page only show basic examples. I am at a significant disadvantage when it comes to LXML and the underlying libxml2/libxslt libraries as I cannot read the C source. (Back when I took formal programming courses, the three choices offered to engineering students were Basic, Fortran, and this hot, up-and-coming language called Pascal which was supposed to set the world on fire.) All of the documentation for the libxml2/libxslt libraries on xmlsoft.org is written from a C perspective and checking out the source code for lxml won't help me much, either. So I am limited to whatever I can glean from the lxml web site examples and whatever I can discover using the usual Python code inspection techniques. (Which don't go very far when much of the functionality resides in precompiled binaries.) So really, the only way I'd be able to determine how far support for a given X-standard goes in lxml is to write a whole bunch of test cases. (This is how I figured out that lxml has broad support for the Xinclude standard, even though the lxml API page states that "simple" Xinclude suport exists.) Please don't infer from this that I have a negative tone towards lxml; I do not. I think it's absolutely great. I have tried pretty much every Python-based XML/XSLT/Xwhatever code base out there and lxml is really the only one that is robust enough, reliable enough, and FAST enough to be useful for production use. I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates. It works great! If there's one thing I'd like to add to the lxml "wish list" it would be some more in-depth examples on the web site - there's a lot more things I'd like to be doing with lxml if I could just figure out if it will do them and how. -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Martijn Faassen Sent: Tuesday, December 12, 2006 1:00 PM To: lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Hello, Lee Brown wrote: predicates? Could you explain how lxml is lacking in support for axes and predicates? Possibly you've only been looking at '.find()'. which is the ElementTree compatible buth limited xpath implementation, and not at the full '.xpath()' functionality? Regards, Martijn _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Dear Lee Brown, I use lxml in the same context but their is an issue I like to warn you about, it is not possible to access global precompiled XSLT styles from different theads. But mod python uses multiple threads. The only solution I found up till now is to prevent mod_python from using multiple threads by globaly setting PythonInterpreter somename in the mod_python related apache config while using the prefork apache module or to build the style on each request new. If you see a better solution please tell me. Regards Hans Am Donnerstag, den 14.12.2006, 08:51 -0500 schrieb Lee Brown: I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates.

Greetings! Thanks for the warning, but I've already run headfirst into that problem. My apache server is running the Win32MPM, where every request is a new thread, so there aren't any tricks I can play with the PythonInterpreter directive. (None that help, anyway.) However, I did some benchmark tests and found that I can serve about 32 requests per second even with the overhead of recompiling the XSLT template new for each request. This is adequate for my needs, though a very busy website might have trouble. One thing I haven't tried is to pre-compile my XSLT templates and cPickle them to disk files and then unpickle a copy to serve the request. A web server with a good file caching system might have to do very few actual disk reads but whether it is faster to unpickle a compiled template object than to just re-compile a new one remains unknown. If anyone has a suggestion for building a thread-safe set of precompiled templates, I'm all ears.... -----Original Message----- From: Hans-Jürgen Hay [mailto:hjh@alterras.de] Sent: Thursday, December 14, 2006 9:50 AM To: Lee Brown; lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Dear Lee Brown, I use lxml in the same context but their is an issue I like to warn you about, it is not possible to access global precompiled XSLT styles from different theads. But mod python uses multiple threads. The only solution I found up till now is to prevent mod_python from using multiple threads by globaly setting PythonInterpreter somename in the mod_python related apache config while using the prefork apache module or to build the style on each request new. If you see a better solution please tell me. Regards Hans Am Donnerstag, den 14.12.2006, 08:51 -0500 schrieb Lee Brown: I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates.

Greethings, I found out very late and this gave me serious headaches. pickle does not work with XSLT objects. eaven if it did it would propably perform much worse than building from source. But thanx for the tip, maybe the developers can help out a little with coarse graind locking at a later stage. Using lxml with mod_python should be an interesting use case. Regards Hans Am Donnerstag, den 14.12.2006, 10:46 -0500 schrieb Lee Brown:

Lee Brown wrote:
I'm not clear exactly on the way threads and mod_python and all that work, but I imagine you could use a pool of templates. You'd do something like: try: tmpl = template_pool.pop() except IndexError: tmpl = compile_template() # then to return the template to the pool: template_pool.append(tmpl) This is assuming that it's okay to move templates between threads, but not use them concurrently between threads. Or if they have to be used in the thread they were created in, you can use: import threading template_cache = threading.local() try: tmpl = template_cache.template except AttributeError: tmpl = template_cache.template = compile_template() That's assuming that threads are long-lived, otherwise this won't change anything either. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org

Greetings! The Apache web server has several different MPMs (Multi-Processing Modules) available to it (unless you're running the Win32MPM, in which case that's the one you're stuck with.) But basically, the web server can spawn either processes or threads to handle incoming requests. In the Win32MPM, each VHOST (virtual web site) runs as a separate OS process and each request that a VHOST receives is handled entirely as a thread within that process. Each thread invokes a chain of request handlers (code modules that handle specific tasks like authentication, authorization, content delivery, output filtering, and so forth) that are instantiated for that thread and then they die at the end of the request. Request threads may arrive simultaneously and are by nature very short-lived. If a VHOST gets 32 simultaneous requests, 32 threads get created and then within a second or two all 32 threads are finished and terminated. (By default the Win32 MPM can have a maximum of 250 concurrent threads.) What Mod Python does is to allow you to specify a python function that will handle a specific task or tasks in the chain in lieu of Apache's standard handlers. Mod Python's default behavior is to create a Python interpreter for each VHOST and this interpreter is responsible for executing the various handler functions in a thread-safe way for each request. (I have NO idea how it does it, nor is my state of confusion likely to change even if someone explains it to me.) The source code containing the function is imported as a module at interpreter startup in the normal 'Python' way, that is, executable code in the module defined outside of the handler function definition(s) is executed on import and is global to the handler function(s). So, naively, I wrote some global code to pre-load and pre-compile all of my XSLT templates into a dictionary at startup. Then, within the handler function definition I look up the correct template in the dictionary and use it to transform the parsed XML source object. This worked just fine as long as one and only one thread was being executed at any given time. Simultaneous requests would either bomb out with a threading-related error or just hang until the server ran out of available threads and crashed. Apparently, Mod Python can dole out handler functions in a thread-safe way, but any global objects you create at import time are not so lucky. Nor does there seem to be any way to share an object from one thread with another thread. One way around this may be to pass a copy of the template dictionary to the handler function, that is, pass a literal copy instead of an object reference. This would eliminate the time overhead of recompiling templates for each request at the expense of possibly having a lot of copies in-memory at one time. But since my server always seems to have plenty of free memory, I'll give it a try. -----Original Message----- From: Ian Bicking [mailto:ianb@colorstudy.com] Sent: Thursday, December 14, 2006 4:50 PM To: Lee Brown Cc: 'Hans-Jürgen Hay'; lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Lee Brown wrote:
I'm not clear exactly on the way threads and mod_python and all that work, but I imagine you could use a pool of templates. You'd do something like: try: tmpl = template_pool.pop() except IndexError: tmpl = compile_template() # then to return the template to the pool: template_pool.append(tmpl) This is assuming that it's okay to move templates between threads, but not use them concurrently between threads. Or if they have to be used in the thread they were created in, you can use: import threading template_cache = threading.local() try: tmpl = template_cache.template except AttributeError: tmpl = template_cache.template = compile_template() That's assuming that threads are long-lived, otherwise this won't change anything either. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org

David Soulayrol a écrit :
lxml is does provide a DOM API implementation but an ElementTree API which is similar to DOM but simpler to use (more "pythonic"). As for XSLT and XPATH, lxml support them out of the box. If you really need a DOM API, the you probably should look at this project: http://www.python.org/pypi/libxml2dom -- Olivier

Hi, David Soulayrol wrote:
lxml does not implement the DOM API either. Instead, as the cheeseshop page nicely states: --------------------- lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more. --------------------- Feel free to find out more from the documentation, it's full of examples: http://codespeak.net/lxml/#documentation Stefan

Hey, Stefan Behnel wrote: [snip]
Note that the ElementTree API is a developing Python standard, implemented by 3 separate libraries, ElementTree, cElementTree and lxml. ElementTree and cElement have become part of the core Python distribution as of Python 2.5. A lot of ElementTree documentation can be found here: http://effbot.org/zone/element-index.htm You can do common DOM-style manipulations through this API, just in a more convenient manner. As to XPath and XSLT support, lxml has that, including the ability to create extension functions and the like. There are differences in feature set here and there, but overall lxml should be able to compete with 4Suite. Regards, Martijn

Greetings! This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and predicates? Also, congratulations on the thorough Xinclude support. Other than 4Suite, LXML is one of the few parsers that handle the "parse=" attribute and the "fallback" elements correctly. -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Martijn Faassen Sent: Monday, December 11, 2006 2:31 PM To: lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Hey, Stefan Behnel wrote: [snip]
Note that the ElementTree API is a developing Python standard, implemented by 3 separate libraries, ElementTree, cElementTree and lxml. ElementTree and cElement have become part of the core Python distribution as of Python 2.5. A lot of ElementTree documentation can be found here: http://effbot.org/zone/element-index.htm You can do common DOM-style manipulations through this API, just in a more convenient manner. As to XPath and XSLT support, lxml has that, including the ability to create extension functions and the like. There are differences in feature set here and there, but overall lxml should be able to compete with 4Suite. Regards, Martijn _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Hello, Lee Brown wrote:
This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and predicates?
Could you explain how lxml is lacking in support for axes and predicates? Possibly you've only been looking at '.find()'. which is the ElementTree compatible buth limited xpath implementation, and not at the full '.xpath()' functionality? Regards, Martijn

This discussion reminded me of a question that I've been pondering: will the XPATH support in LXML eventually include support for axes and
Greetings! I apologize for my mistaken presumption. I presumed it only supported basic Xpath functions because the examples on the lxml API web page only show basic examples. I am at a significant disadvantage when it comes to LXML and the underlying libxml2/libxslt libraries as I cannot read the C source. (Back when I took formal programming courses, the three choices offered to engineering students were Basic, Fortran, and this hot, up-and-coming language called Pascal which was supposed to set the world on fire.) All of the documentation for the libxml2/libxslt libraries on xmlsoft.org is written from a C perspective and checking out the source code for lxml won't help me much, either. So I am limited to whatever I can glean from the lxml web site examples and whatever I can discover using the usual Python code inspection techniques. (Which don't go very far when much of the functionality resides in precompiled binaries.) So really, the only way I'd be able to determine how far support for a given X-standard goes in lxml is to write a whole bunch of test cases. (This is how I figured out that lxml has broad support for the Xinclude standard, even though the lxml API page states that "simple" Xinclude suport exists.) Please don't infer from this that I have a negative tone towards lxml; I do not. I think it's absolutely great. I have tried pretty much every Python-based XML/XSLT/Xwhatever code base out there and lxml is really the only one that is robust enough, reliable enough, and FAST enough to be useful for production use. I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates. It works great! If there's one thing I'd like to add to the lxml "wish list" it would be some more in-depth examples on the web site - there's a lot more things I'd like to be doing with lxml if I could just figure out if it will do them and how. -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Martijn Faassen Sent: Tuesday, December 12, 2006 1:00 PM To: lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Hello, Lee Brown wrote: predicates? Could you explain how lxml is lacking in support for axes and predicates? Possibly you've only been looking at '.find()'. which is the ElementTree compatible buth limited xpath implementation, and not at the full '.xpath()' functionality? Regards, Martijn _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Dear Lee Brown, I use lxml in the same context but their is an issue I like to warn you about, it is not possible to access global precompiled XSLT styles from different theads. But mod python uses multiple threads. The only solution I found up till now is to prevent mod_python from using multiple threads by globaly setting PythonInterpreter somename in the mod_python related apache config while using the prefork apache module or to build the style on each request new. If you see a better solution please tell me. Regards Hans Am Donnerstag, den 14.12.2006, 08:51 -0500 schrieb Lee Brown: I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates.

Greetings! Thanks for the warning, but I've already run headfirst into that problem. My apache server is running the Win32MPM, where every request is a new thread, so there aren't any tricks I can play with the PythonInterpreter directive. (None that help, anyway.) However, I did some benchmark tests and found that I can serve about 32 requests per second even with the overhead of recompiling the XSLT template new for each request. This is adequate for my needs, though a very busy website might have trouble. One thing I haven't tried is to pre-compile my XSLT templates and cPickle them to disk files and then unpickle a copy to serve the request. A web server with a good file caching system might have to do very few actual disk reads but whether it is faster to unpickle a compiled template object than to just re-compile a new one remains unknown. If anyone has a suggestion for building a thread-safe set of precompiled templates, I'm all ears.... -----Original Message----- From: Hans-Jürgen Hay [mailto:hjh@alterras.de] Sent: Thursday, December 14, 2006 9:50 AM To: Lee Brown; lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Dear Lee Brown, I use lxml in the same context but their is an issue I like to warn you about, it is not possible to access global precompiled XSLT styles from different theads. But mod python uses multiple threads. The only solution I found up till now is to prevent mod_python from using multiple threads by globaly setting PythonInterpreter somename in the mod_python related apache config while using the prefork apache module or to build the style on each request new. If you see a better solution please tell me. Regards Hans Am Donnerstag, den 14.12.2006, 08:51 -0500 schrieb Lee Brown: I am currently using lxml in conjunction with Mod Python on an Apache web server to serve XML content data, merging dynamic data through Xincludes and transforming the output on-the-fly into XHTML using XSLT templates.

Greethings, I found out very late and this gave me serious headaches. pickle does not work with XSLT objects. eaven if it did it would propably perform much worse than building from source. But thanx for the tip, maybe the developers can help out a little with coarse graind locking at a later stage. Using lxml with mod_python should be an interesting use case. Regards Hans Am Donnerstag, den 14.12.2006, 10:46 -0500 schrieb Lee Brown:

Lee Brown wrote:
I'm not clear exactly on the way threads and mod_python and all that work, but I imagine you could use a pool of templates. You'd do something like: try: tmpl = template_pool.pop() except IndexError: tmpl = compile_template() # then to return the template to the pool: template_pool.append(tmpl) This is assuming that it's okay to move templates between threads, but not use them concurrently between threads. Or if they have to be used in the thread they were created in, you can use: import threading template_cache = threading.local() try: tmpl = template_cache.template except AttributeError: tmpl = template_cache.template = compile_template() That's assuming that threads are long-lived, otherwise this won't change anything either. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org

Greetings! The Apache web server has several different MPMs (Multi-Processing Modules) available to it (unless you're running the Win32MPM, in which case that's the one you're stuck with.) But basically, the web server can spawn either processes or threads to handle incoming requests. In the Win32MPM, each VHOST (virtual web site) runs as a separate OS process and each request that a VHOST receives is handled entirely as a thread within that process. Each thread invokes a chain of request handlers (code modules that handle specific tasks like authentication, authorization, content delivery, output filtering, and so forth) that are instantiated for that thread and then they die at the end of the request. Request threads may arrive simultaneously and are by nature very short-lived. If a VHOST gets 32 simultaneous requests, 32 threads get created and then within a second or two all 32 threads are finished and terminated. (By default the Win32 MPM can have a maximum of 250 concurrent threads.) What Mod Python does is to allow you to specify a python function that will handle a specific task or tasks in the chain in lieu of Apache's standard handlers. Mod Python's default behavior is to create a Python interpreter for each VHOST and this interpreter is responsible for executing the various handler functions in a thread-safe way for each request. (I have NO idea how it does it, nor is my state of confusion likely to change even if someone explains it to me.) The source code containing the function is imported as a module at interpreter startup in the normal 'Python' way, that is, executable code in the module defined outside of the handler function definition(s) is executed on import and is global to the handler function(s). So, naively, I wrote some global code to pre-load and pre-compile all of my XSLT templates into a dictionary at startup. Then, within the handler function definition I look up the correct template in the dictionary and use it to transform the parsed XML source object. This worked just fine as long as one and only one thread was being executed at any given time. Simultaneous requests would either bomb out with a threading-related error or just hang until the server ran out of available threads and crashed. Apparently, Mod Python can dole out handler functions in a thread-safe way, but any global objects you create at import time are not so lucky. Nor does there seem to be any way to share an object from one thread with another thread. One way around this may be to pass a copy of the template dictionary to the handler function, that is, pass a literal copy instead of an object reference. This would eliminate the time overhead of recompiling templates for each request at the expense of possibly having a lot of copies in-memory at one time. But since my server always seems to have plenty of free memory, I'll give it a try. -----Original Message----- From: Ian Bicking [mailto:ianb@colorstudy.com] Sent: Thursday, December 14, 2006 4:50 PM To: Lee Brown Cc: 'Hans-Jürgen Hay'; lxml-dev@codespeak.net Subject: Re: [lxml-dev] About lxml status Lee Brown wrote:
I'm not clear exactly on the way threads and mod_python and all that work, but I imagine you could use a pool of templates. You'd do something like: try: tmpl = template_pool.pop() except IndexError: tmpl = compile_template() # then to return the template to the pool: template_pool.append(tmpl) This is assuming that it's okay to move templates between threads, but not use them concurrently between threads. Or if they have to be used in the thread they were created in, you can use: import threading template_cache = threading.local() try: tmpl = template_cache.template except AttributeError: tmpl = template_cache.template = compile_template() That's assuming that threads are long-lived, otherwise this won't change anything either. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org

David Soulayrol a écrit :
lxml is does provide a DOM API implementation but an ElementTree API which is similar to DOM but simpler to use (more "pythonic"). As for XSLT and XPATH, lxml support them out of the box. If you really need a DOM API, the you probably should look at this project: http://www.python.org/pypi/libxml2dom -- Olivier
participants (7)
-
David Soulayrol
-
Hans-Jürgen Hay
-
Ian Bicking
-
Lee Brown
-
Martijn Faassen
-
Olivier Grisel
-
Stefan Behnel