[lxml-dev] exslt:regexp implementation based on 're'

Hi, I noticed that exslt:regexp was not supported by libexslt, so I wrote three extension functions that use Python's re module (which is not really JavaScript compatible as requested by the spec, but who cares...). Here's an example: ----------------------------------------
xslt = etree.XSLT(etree.XML("""\ <xsl:stylesheet version="1.0" xmlns:regexp="http://exslt.org/regular-expressions" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*"> <test><xsl:copy-of select="*[regexp:test(string(.), '8.')]"/></test> </xsl:template> </xsl:stylesheet> """))
result = xslt(etree.XML('<a><b>123</b><b>098</b><b>987</b></a>')) print str(result) <test><b>987</b></test>
Since the test cases worked out perfectly, it's already in the trunk. So, when the regular exslt support gets merged, lxml will have more complete exslt support than libxslt itself. :) Stefan

Hey, Stefan Behnel wrote:
I noticed that exslt:regexp was not supported by libexslt, so I wrote three extension functions that use Python's re module (which is not really JavaScript compatible as requested by the spec, but who cares...).
I think one might care if one had a stylesheet that uses exslt and then have it not work with lxml because the regex behavior is different?
Here's an example:
----------------------------------------
xslt = etree.XSLT(etree.XML("""\ <xsl:stylesheet version="1.0" xmlns:regexp="http://exslt.org/regular-expressions" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*"> <test><xsl:copy-of select="*[regexp:test(string(.), '8.')]"/></test> </xsl:template> </xsl:stylesheet> """))
result = xslt(etree.XML('<a><b>123</b><b>098</b><b>987</b></a>')) print str(result) <test><b>987</b></test>
Since the test cases worked out perfectly, it's already in the trunk. So, when the regular exslt support gets merged, lxml will have more complete exslt support than libxslt itself. :)
Cool. :) One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern. Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Stefan Behnel wrote:
I noticed that exslt:regexp was not supported by libexslt, so I wrote three extension functions that use Python's re module (which is not really JavaScript compatible as requested by the spec, but who cares...).
I think one might care if one had a stylesheet that uses exslt and then have it not work with lxml because the regex behavior is different?
The API is identical, it just depends on what sort of expressions you use. The normal ().*+ stuff should be the same, also \w and the like. But you'll never find two RE implementations that are completely compatible. So, well, you'll just have to take care if you want to write portable stylesheets. Note that many processors do not even support REs at all and different processors base their support on different libraries (JavaScript or Apache or whatever).
Here's an example:
----------------------------------------
xslt = etree.XSLT(etree.XML("""\ <xsl:stylesheet version="1.0" xmlns:regexp="http://exslt.org/regular-expressions" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*"> <test><xsl:copy-of select="*[regexp:test(string(.), '8.')]"/></test> </xsl:template> </xsl:stylesheet> """))
result = xslt(etree.XML('<a><b>123</b><b>098</b><b>987</b></a>')) print str(result) <test><b>987</b></test>
Since the test cases worked out perfectly, it's already in the trunk. So, when the regular exslt support gets merged, lxml will have more complete exslt support than libxslt itself. :)
Cool. :)
One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern.
I wouldn't know why there should be any risks. The regexps are just handed to the re.compile function as is and there shouldn't be any way to break out of the (s)re module. There are no calls to "eval" or anything like it. The EXSLT extensions shouldn't do any harm either. On the other hand, registering the libxslt "extra" extension functions may be a risk. There is a "debug" element that becomes accessible and the "output" and "write" elements that can write(!) to files. So, maybe we should require some initialization function call to add those extras. I'll just remove the "extra" registration for now. Also, remember that the document() function can be used to access local XML files. That may already be a risk in some cases. Stefan

Martijn Faassen wrote:
Stefan Behnel wrote: [snip]
Also, remember that the document() function can be used to access local XML files. That may already be a risk in some cases.
Good point. The custom resolver story could help against that, right?
Right. As long as you return anything but None from the Python resolvers, it will be parsed and handed directly back to libxslt. So, if you want to keep libxslt from doing any access to network or hard-disk, it "should" (untested) be enough to write a dummy resolver that returns a dummy or the empty document (resolve_empty()). Stefan

Martijn Faassen wrote:
One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern.
you can "hang" RE if you want (by crafting a really lousy RE that causes excessive backtracking), but since you can "hang" any XML parser that supports internal DTD:s (google for the "billion laughs attack"), I'm not sure how serious this is. I wouldn't accept XSLT programs from untrusted sources, though... </F>

Fredrik Lundh wrote:
Martijn Faassen wrote:
One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern.
you can "hang" RE if you want (by crafting a really lousy RE that causes excessive backtracking), but since you can "hang" any XML parser that supports internal DTD:s (google for the "billion laughs attack"), I'm not sure how serious this is.
I wouldn't accept XSLT programs from untrusted sources, though...
Sure, that's the main threat. XSLT is Turing-complete. Anyone can write an infinitely recursing stylesheet - and no machine can ever decide if it will terminate... Stefan

Fredrik Lundh wrote:
Martijn Faassen wrote:
One thing that I wonder about is potential security issues? Are there ways to break out of the Python regexs and call arbitrary python code? If not, then we don't need to worry about it. XSLT can be run from fairly unsafe sources so this may be a concern.
you can "hang" RE if you want (by crafting a really lousy RE that causes excessive backtracking), but since you can "hang" any XML parser that supports internal DTD:s (google for the "billion laughs attack"), I'm not sure how serious this is.
I wouldn't accept XSLT programs from untrusted sources, though...
Agreed that accepting any programs from untrusted sources is dangerous, but it depends also a bit on exactly how untrusted your sources are. I just wanted to make sure we didn't get some kind of potential privilege escalation where people from XSLT could trigger Python by cleverly crafted regexes using some specific extension in Python that I don't know about. Apparently this is safe. Regarsd, Martijn
participants (3)
-
Fredrik Lundh
-
Martijn Faassen
-
Stefan Behnel