[lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode

Hello everyone, I'm developing a mod_python application that is based on XML\XSLT transforming. I used 4Suite libraries for that, but as the speed was unacceptable for me, I switched to lxml. Everything became much easier and 10 times faster, but I've encountered the subject problem. In brief - all my data and xslt are stored and transferred in UTF-8. With 4Suite everything was fine all the time. With lxml it works fine from the console, but inside mod_python it occasionaly dies, ~ one time out of three. Strange - the same code with the same data works or dies by its own means. As far as I have found, there was a similar problem with PyXML and encodings module, this is the problem with UTF, but there was no clear solution. So, my configuration is the following: Python 2.5.1 Server version: Apache/2.2.4 (FreeBSD) mod_python-3.3.1 And the relevant parts of my code are these: def extApplyXslt(xslt, data, logger ): try: strXslt = urllib2.urlopen(xslt).read() # i have to read the xslt url to the python string except urllib2.HTTPError, e: ....... except urllib2.URLError, e: ............. try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") ) # and now I have to use the string; a more elegant solution, anyone? f = StringIO(strXslt) xslt_doc = etree.parse(f, xslt_parser) # and here where the problem comes transform = etree.XSLT(xslt_doc) except Exception, exc: logger.log(logging.CRITICAL, exc.__str__() ) try: result_tree = transform(data) return etree.tostring(result_tree, 'utf-8') except Exception, exc: print "xslt processing error!", exc.__str__() return "" It dies with the message 'cannot unmarshal code objects in restricted execution mode'. By profiling I detected the point where problem occurs: transform = etree.XSLT(xslt_doc) So, I would be grateful for any suggestions how to get rid of this. I'd really like to use lxml. Maybe I should initialize the xslt processor in somehow other way? Thanks in advance, Dmitri

Greetings! The first thing I'd suggest is to also put your query on the Mod Python list as well. A few questions: Are you trying to execute this code in a Handler or in a Filter? There's world of hidden trouble lurking in Filters because of their re-entrant nature. Which Apache MPM are you using? If you're using a multiple-process module, you might try swithing to a single-process-multiple-thread module to see if this behavior changes.
-----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Dmitri Fedoruk Sent: Thursday, September 13, 2007 11:18 AM To: lxml-dev@codespeak.net Subject: [lxml-dev] lxml + mod_python: cannot unmarshal code objects inrestricted execution mode
Hello everyone,
I'm developing a mod_python application that is based on XML\XSLT transforming.
I used 4Suite libraries for that, but as the speed was unacceptable for me, I switched to lxml. Everything became much easier and 10 times faster, but I've encountered the subject problem.
In brief - all my data and xslt are stored and transferred in UTF-8. With 4Suite everything was fine all the time. With lxml it works fine from the console, but inside mod_python it occasionaly dies, ~ one time out of three. Strange - the same code with the same data works or dies by its own means.
As far as I have found, there was a similar problem with PyXML and encodings module, this is the problem with UTF, but there was no clear solution.
So, my configuration is the following: Python 2.5.1 Server version: Apache/2.2.4 (FreeBSD) mod_python-3.3.1
And the relevant parts of my code are these:
def extApplyXslt(xslt, data, logger ): try: strXslt = urllib2.urlopen(xslt).read() # i have to read the xslt url to the python string except urllib2.HTTPError, e: ....... except urllib2.URLError, e: ............. try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") )
# and now I have to use the string; a more elegant solution, anyone? f = StringIO(strXslt) xslt_doc = etree.parse(f, xslt_parser)
# and here where the problem comes transform = etree.XSLT(xslt_doc)
except Exception, exc: logger.log(logging.CRITICAL, exc.__str__() )
try: result_tree = transform(data) return etree.tostring(result_tree, 'utf-8') except Exception, exc: print "xslt processing error!", exc.__str__() return ""
It dies with the message 'cannot unmarshal code objects in restricted execution mode'. By profiling I detected the point where problem occurs: transform = etree.XSLT(xslt_doc)
So, I would be grateful for any suggestions how to get rid of this. I'd really like to use lxml. Maybe I should initialize the xslt processor in somehow other way?
Thanks in advance, Dmitri _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Dmitri Fedoruk wrote:
I'm developing a mod_python application that is based on XML\XSLT transforming.
I used 4Suite libraries for that, but as the speed was unacceptable for me, I switched to lxml. Everything became much easier and 10 times faster
Thanks for sharing that. :)
but I've encountered the subject problem.
In brief - all my data and xslt are stored and transferred in UTF-8. With 4Suite everything was fine all the time. With lxml it works fine from the console, but inside mod_python it occasionaly dies, ~ one time out of three. Strange - the same code with the same data works or dies by its own means.
As far as I have found, there was a similar problem with PyXML and encodings module, this is the problem with UTF, but there was no clear solution.
So, my configuration is the following: Python 2.5.1 Server version: Apache/2.2.4 (FreeBSD) mod_python-3.3.1
Looks like you forgot to mention the lxml version you are using.
And the relevant parts of my code are these:
def extApplyXslt(xslt, data, logger ): try: strXslt = urllib2.urlopen(xslt).read() # i have to read the xslt url to the python string except urllib2.HTTPError, e: ....... except urllib2.URLError, e: ............. try: xslt_parser = etree.XMLParser() xslt_parser.resolvers.add( PrefixResolver("XSLT") )
# and now I have to use the string; a more elegant solution,
As I already mentioned on c.l.py, you can pass the result of urlopen() directly into parse().
f = StringIO(strXslt) xslt_doc = etree.parse(f, xslt_parser)
# and here where the problem comes transform = etree.XSLT(xslt_doc)
except Exception, exc: logger.log(logging.CRITICAL, exc.__str__() )
try: result_tree = transform(data) return etree.tostring(result_tree, 'utf-8') except Exception, exc: print "xslt processing error!", exc.__str__() return ""
It dies with the message 'cannot unmarshal code objects in restricted execution mode'. By profiling I detected the point where problem occurs: transform = etree.XSLT(xslt_doc)
Hmmm, I can't see where any "unmarshaling" should be taking place here - definitely not in XSLT(). And I don't get why this should only happen once in a while. Can you figure out what is writing this message? The python interpreter or mod_python? Stefan

Everything became much easier and 10 times faster, but I've encountered the subject problem.
Same problem here, but with different code and versions: * Django as webframework * Apache 2.0.59 and 2.2.4 * lxml 1.3.x (all versions) * mod_python 3.2.10 and 3.3.1 * libxml2 2.6.28 / libxslt 1.1.20 I think this might have something to do with mod_python fiddling with __builtins__, at least googling for the error message told me, that Python switches to restricted mode when doing so (but this might one trigger of many). lxml seems to have callbacks run in its own "sandbox" (or something like this, at least it seems to be a different environment as the outer code had), which works fine unless the restricted mode is triggered. Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks. Some further bug-finding I did revealed, that the "unmarshaling"-error only occured if all modules I used in the callback are loaded before the callback runs. If I load them inside the callback the error differs. Example: ------------8<---------------------------------------------------- # unmarshaling error from foo import bar def callback(ctx, ...): return bar() ---------------------------------------------------->8------------ ------------8<---------------------------------------------------- # other error def callback(ctx, ...): from foo import bar return bar() ---------------------------------------------------->8------------ As I have the needed mod_python-configuration not done here I can't tell the other error, but I will add this later. (And I think it was some ImportError) I did not report this problem, because I was not sure which part in the chain to produce webpages was responsible. Django does fiddle with __builtins__, too (but removing it didn't help). And perhaps this is simply a mod_python-bug. So I used FastCGI, which works well. But I'm very interested in a better solution. ;-) For the questions raised by Lee Brown:
Are you trying to execute this code in a Handler or in a Filter? There's world of hidden trouble lurking in Filters because of their re-entrant nature.
I use normal XSLT-callbacks. Tried different methods to tell lxml which callbacks I have, none worked. (global namespace, callbacks as "extensions"-parameter for etree.XSLT) XSLT-sample-snippet: <xsl:value-of select="py:highlightInline(string(.))" disable-output-escaping="yes"/> (Namespace is defined, callback gets called and works fine...until I try to use the code with mod_python)
Which Apache MPM are you using? If you're using a multiple-process module, you might try swithing to a single-process-multiple-thread module to see if this behavior changes.
Using prefork here, as all threaded modules have problems with mod_php. mod_php might be another error-source. Read something about failing DB-connections when using mod_php and mod_python. But I don't really think disabling mod_php will make a difference here. Greetings, David Danier

Greetings! Sorry, I should have stated my first question more clearly. Are you calling your routines from within a Mod Python requestHandler object or an outputFilter object?
-----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of David Danier Sent: Thursday, September 13, 2007 12:02 PM To: lxml-dev@codespeak.net Subject: Re: [lxml-dev] lxml + mod_python: cannot unmarshal code objects in restricted execution mode
Everything became much easier and 10 times faster, but I've encountered the subject problem.
Same problem here, but with different code and versions: * Django as webframework * Apache 2.0.59 and 2.2.4 * lxml 1.3.x (all versions) * mod_python 3.2.10 and 3.3.1 * libxml2 2.6.28 / libxslt 1.1.20
I think this might have something to do with mod_python fiddling with __builtins__, at least googling for the error message told me, that Python switches to restricted mode when doing so (but this might one trigger of many). lxml seems to have callbacks run in its own "sandbox" (or something like this, at least it seems to be a different environment as the outer code had), which works fine unless the restricted mode is triggered.
Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks.
Some further bug-finding I did revealed, that the "unmarshaling"-error only occured if all modules I used in the callback are loaded before the callback runs. If I load them inside the callback the error differs. Example: ------------8<---------------------------------------------------- # unmarshaling error from foo import bar def callback(ctx, ...): return bar() ---------------------------------------------------->8------------ ------------8<---------------------------------------------------- # other error def callback(ctx, ...): from foo import bar return bar() ---------------------------------------------------->8------------ As I have the needed mod_python-configuration not done here I can't tell the other error, but I will add this later. (And I think it was some ImportError)
I did not report this problem, because I was not sure which part in the chain to produce webpages was responsible. Django does fiddle with __builtins__, too (but removing it didn't help). And perhaps this is simply a mod_python-bug. So I used FastCGI, which works well. But I'm very interested in a better solution. ;-)
For the questions raised by Lee Brown:
Are you trying to execute this code in a Handler or in a Filter? There's world of hidden trouble lurking in Filters because of their re-entrant nature.
I use normal XSLT-callbacks. Tried different methods to tell lxml which callbacks I have, none worked. (global namespace, callbacks as "extensions"-parameter for etree.XSLT)
XSLT-sample-snippet: <xsl:value-of select="py:highlightInline(string(.))" disable-output-escaping="yes"/> (Namespace is defined, callback gets called and works fine...until I try to use the code with mod_python)
Which Apache MPM are you using? If you're using a multiple-process module, you might try swithing to a single-process-multiple-thread module to see if this behavior changes.
Using prefork here, as all threaded modules have problems with mod_php. mod_php might be another error-source. Read something about failing DB-connections when using mod_php and mod_python. But I don't really think disabling mod_php will make a difference here.
Greetings, David Danier _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Sorry, I should have stated my first question more clearly. Are you calling your routines from within a Mod Python requestHandler object or an outputFilter object?
It is called out of a RequestHandler, but I'm not really doing this myself. Django does most of the work, see: http://www.djangoproject.com/documentation/modpython/ http://code.djangoproject.com/browser/django/trunk/django/core/handlers/modp... Greetings, David Danier

Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks.
Found another place that mentions restricted mode by accident: http://www.modpython.org/live/current/doc-html/pyapi-interps.html I think this paragraph describes the problem pretty well: ------------8<---------------------------------------------------- Note that if any third party module is being used which has a C code component that uses the simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules, then the interpreter name must be forcibly set to be "main_interpreter". This is necessary as such a module will only work correctly if run within the context of the first Python interpreter created by the process. If not forced to run under the "main_interpreter", a range of Python errors can arise, each typically referring to code being run in restricted mode. ---------------------------------------------------->8------------ (thanks to Lee Brown for asking about where lxml is called, it made me read the mod_python-docs again) I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. According to the docs this might help...but if I read this right might produce namespace-problems or at least pollute some global namespace. As this takes some time I will post the result later. Perhaps it can be fixed in lxml by not using the "simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules"? Greetings, David Danier

Hi, David Danier wrote:
Somehow restricted mode is only mentioned in the docs for RExec (http://docs.python.org/lib/module-rexec.html), but should not be available any more, to I don't know what lxml exactly does to use callbacks.
Found another place that mentions restricted mode by accident: http://www.modpython.org/live/current/doc-html/pyapi-interps.html
I think this paragraph describes the problem pretty well: ------------8<---------------------------------------------------- Note that if any third party module is being used which has a C code component that uses the simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules, then the interpreter name must be forcibly set to be "main_interpreter". This is necessary as such a module will only work correctly if run within the context of the first Python interpreter created by the process. If not forced to run under the "main_interpreter", a range of Python errors can arise, each typically referring to code being run in restricted mode. ---------------------------------------------------->8------------ (thanks to Lee Brown for asking about where lxml is called, it made me read the mod_python-docs again)
thanks for the infos, that's good to know.
I'll try to setup my site on mod_python and using "PythonInterpreter main_interpreter" in the config. According to the docs this might help...but if I read this right might produce namespace-problems or at least pollute some global namespace. As this takes some time I will post the result later.
Please do.
Perhaps it can be fixed in lxml by not using the "simplified API for access to the Global Interpreter Lock (GIL) for Python extension modules"?
No way. There's a reason why it is there which is the same why we use it: it's simple and usable. Using anything else would mean a lot of rewriting. You might want to try compiling lxml with "--without-threading", though, which disables concurrency support completely (i.e. not more GIL freeing). Stefan

As this takes some time I will post the result later. Please do.
Seems to work properly. But I'm not really sure how bad "main_interpreter" is polluted now.
No way. There's a reason why it is there which is the same why we use it: it's simple and usable. Using anything else would mean a lot of rewriting.
Thats sad. What are the chances that patches addressing this problem are accepted? (Must review the code first, but I would really like a clean solution here)
You might want to try compiling lxml with "--without-threading", though, which disables concurrency support completely (i.e. not more GIL freeing).
Works, too. But I'm not really sure it it is a good idea to do so, as Py_NewInterpreter seems to create a thread, see http://www.python.org/doc/current/api/initialization.html#l2h-820. But I think this might not be a problem if not using a threaded Apache-MPM. Greetings, David Danier

David Danier wrote:
As this takes some time I will post the result later. Please do.
Seems to work properly. But I'm not really sure how bad "main_interpreter" is polluted now.
I wouldn't expect much (namespace) polution - unless there's real evidence that this can become a problem. And a crash is definitely a more important problem than namespace polution.
No way. There's a reason why it is there which is the same why we use it: it's simple and usable. Using anything else would mean a lot of rewriting.
Thats sad. What are the chances that patches addressing this problem are accepted? (Must review the code first, but I would really like a clean solution here)
We always accept patches as long as there is general interest and/or a good motivation behind them. But threading is pretty much an issue by itself in lxml.etree. And the "simplified API" gives you a way to just say "release GIL - call to libxml2 - acquire GIL" and "acquire GIL - run callback code - free GIL". That's as easy as it can get - especially since Cython has support for the latter nowadays. It is very unlikely that this can get any "cleaner" by changing the thread-lock calls.
You might want to try compiling lxml with "--without-threading", though, which disables concurrency support completely (i.e. not more GIL freeing).
Works, too. But I'm not really sure it it is a good idea to do so, as Py_NewInterpreter seems to create a thread, see http://www.python.org/doc/current/api/initialization.html#l2h-820. But I think this might not be a problem if not using a threaded Apache-MPM.
What this options does is that lxml.etree stops freeing the GIL internally when calling into libxml2, which simply disables any concurrency as it keeps the GIL until execution returns to Python code. Especially the (simplified) Thread API is no longer used, so there should no longer be any threading issues. Stefan
participants (4)
-
David Danier
-
Dmitri Fedoruk
-
Lee Brown
-
Stefan Behnel