[lxml-dev] python segfault by XSLT tostring?
hi all, just tried lxml-0.8 and than the subversion lxml trunc, but both versions manage to segfault my python. libxml-2.6.16, libxslt-1.1.12, Pyrex-0.9.3, python-2.3.5, openbsd 3.8 is the configuration. in short: funicode at src/lxml/etree.pyx:1841 can get called with null/None as argument, after which isutf8 segfaults on it. this happens when i call tostring() on an lxml.etree.XSLT() object, with an empty document as argument (which was a result of a transformation). this is the code i ran: xsltfile = sys.argv[1] xmlfile = sys.argv[2] xsltdoc = lxml.etree.parse(open(xsltfile, 'r')) xslt = lxml.etree.XSLT(xsltdoc) xml = lxml.etree.parse(open(xmlfile, 'r')) result = xslt.apply(xml) print result print xslt.tostring(result) # can segfault python if result contains "empty document" this is the xslt file: <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> </xsl:template> </xsl:stylesheet> and near-empty xml file: <?xml version="1.0"?> <blah></blah> i admit that i know next to nothing about xslt and very little about xml (i was just playing around), but lxml should never make python segfault, whatever stupid thing i do. my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though. best regards, mechiel
Mechiel Lukkien wrote:
just tried lxml-0.8 and than the subversion lxml trunc, but both versions manage to segfault my python. libxml-2.6.16, libxslt-1.1.12, Pyrex-0.9.3, python-2.3.5, openbsd 3.8 is the configuration.
in short: funicode at src/lxml/etree.pyx:1841 can get called with null/None as argument, after which isutf8 segfaults on it.
this happens when i call tostring() on an lxml.etree.XSLT() object, with an empty document as argument (which was a result of a transformation). this is the code i ran:
xsltfile = sys.argv[1] xmlfile = sys.argv[2] xsltdoc = lxml.etree.parse(open(xsltfile, 'r')) xslt = lxml.etree.XSLT(xsltdoc) xml = lxml.etree.parse(open(xmlfile, 'r')) result = xslt.apply(xml) print result print xslt.tostring(result) # can segfault python if result contains "empty document"
this is the xslt file:
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> </xsl:template> </xsl:stylesheet>
and near-empty xml file:
<?xml version="1.0"?> <blah></blah>
i admit that i know next to nothing about xslt and very little about xml (i was just playing around), but lxml should never make python segfault, whatever stupid thing i do.
my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though.
Hi! Thank you for the bug report. I can reproduce this both on the trunk and my branch using the test case below. It's simply modeled after your example. I'll check if I can figure out something. Stefan Index: src/lxml/tests/test_etree.py =================================================================== --- src/lxml/tests/test_etree.py (Revision 19669) +++ src/lxml/tests/test_etree.py (Arbeitskopie) @@ -2192,6 +2192,21 @@ etree.tostring(result.getroot()) + def test_xslt_empty(self): + # could segfault if result contains "empty document" + xml = '<blah/>' + xslt = ''' + <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> + <xsl:template match="/" /> + </xsl:stylesheet> + ''' + + source = self.parse(xml) + styledoc = self.parse(xslt) + style = etree.XSLT(styledoc) + result = style.apply(source) + xslt.tostring(result) + def test_xslt_shortcut(self): tree = self.parse('<a><b>B</b><c>C</c></a>') style = self.parse('''\
Stefan Behnel wrote:
Mechiel Lukkien wrote:
my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though.
I don't think it does, I'd rather say that would be the right thing to do. I do believe, however, that it should be a bug somewhere else in etree if funicode() is *called* with NULL, so I won't fix it there. It should be caught in XSLT.tostring, which is where the error arises, since afterwards, we call xmlFree on the string (or on NULL resp.). I've applied test case and fix to the trunk, revision 19670/19671. Mechiel, since you've used SVN anyway, please update your version and retry. Stefan
On Wed, Nov 09, 2005 at 11:59:52AM +0100, Stefan Behnel wrote:
Stefan Behnel wrote:
Mechiel Lukkien wrote:
my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though.
I don't think it does, I'd rather say that would be the right thing to do. I do believe, however, that it should be a bug somewhere else in etree if funicode() is *called* with NULL, so I won't fix it there. It should be caught in XSLT.tostring, which is where the error arises, since afterwards, we call xmlFree on the string (or on NULL resp.).
I've applied test case and fix to the trunk, revision 19670/19671.
Mechiel, since you've used SVN anyway, please update your version and retry.
i just updated to the latest version and tried. it seems to work fine. one more remark: with a non-empty document, tostring() generates xml (or so it seems). with an empty document it generates an empty string. is that a valid xml document as well? at least there is no '<?xml version="1.0"?>' in the output. i tried lxml.etree.parse() on an empty file and on a file with just the xml version tag-like line. it raised etree.XMLSyntaxError for both. i'm not sure what tostring() exactly means to do though. if it's "generate valid xml" it might be better to raise some exception in this case. thanks for the quick response, best regards, mechiel
Mechiel Lukkien wrote:
On Wed, Nov 09, 2005 at 11:59:52AM +0100, Stefan Behnel wrote:
Stefan Behnel wrote:
Mechiel Lukkien wrote:
my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though. I don't think it does, I'd rather say that would be the right thing to do. I do believe, however, that it should be a bug somewhere else in etree if funicode() is *called* with NULL, so I won't fix it there. It should be caught in XSLT.tostring, which is where the error arises, since afterwards, we call xmlFree on the string (or on NULL resp.).
I've applied test case and fix to the trunk, revision 19670/19671.
Mechiel, since you've used SVN anyway, please update your version and retry.
i just updated to the latest version and tried. it seems to work fine.
one more remark: with a non-empty document, tostring() generates xml (or so it seems). with an empty document it generates an empty string. is that a valid xml document as well? at least there is no '<?xml version="1.0"?>' in the output. i tried lxml.etree.parse() on an empty file and on a file with just the xml version tag-like line. it raised etree.XMLSyntaxError for both. i'm not sure what tostring() exactly means to do though. if it's "generate valid xml" it might be better to raise some exception in this case.
Good question. I thought about that when I wrote the patch, but I considered an empty result of an XSLT to mean that there are no nodes in the result, specifically that there is no root node. It is not an uncommon case for an XSLT to return nothing, so that by itself should not raise an exception. And when you convert that into a string - why should that raise an exception? You could rather consider returning None, but I think an exception would be too much here. In any case, there is no XML representation of an empty document without root node. I think returning an empty string fits both the semantics of an empty result and of having asked for a string. If the user does not expect an empty result, that's her fault (it was her stylesheet after all), so it's up to the user anyway to check the result. So, IMHO, it's either None or the empty string. Any arguments for using None? Stefan
Hi, On Wed, 2005-11-09 at 12:49 +0100, Stefan Behnel wrote:
Mechiel Lukkien wrote:
On Wed, Nov 09, 2005 at 11:59:52AM +0100, Stefan Behnel wrote:
Stefan Behnel wrote:
Mechiel Lukkien wrote:
my quick fix was to return an empty string at the start of funicode if the string is null. after this, it stopped segfaulting on this small example. good chance that breaks tostring() though. I don't think it does, I'd rather say that would be the right thing to do. I do believe, however, that it should be a bug somewhere else in etree if funicode() is *called* with NULL, so I won't fix it there. It should be caught in XSLT.tostring, which is where the error arises, since afterwards, we call xmlFree on the string (or on NULL resp.).
I've applied test case and fix to the trunk, revision 19670/19671.
Mechiel, since you've used SVN anyway, please update your version and retry.
i just updated to the latest version and tried. it seems to work fine.
one more remark: with a non-empty document, tostring() generates xml (or so it seems). with an empty document it generates an empty string. is that a valid xml document as well? at least there is no '<?xml version="1.0"?>' in the output. i tried lxml.etree.parse() on an empty file and on a file with just the xml version tag-like line. it raised etree.XMLSyntaxError for both. i'm not sure what tostring() exactly means to do though. if it's "generate valid xml" it might be better to raise some exception in this case.
Maybe the definition of the DOM LSSerializer could be of some help to design the behaviour: http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer At least this holds for an XSLT output method of "xml"; I guess for "text" this should look different.
Good question.
I thought about that when I wrote the patch, but I considered an empty result of an XSLT to mean that there are no nodes in the result, specifically that there is no root node. It is not an uncommon case for an XSLT to return nothing, so that by itself should not raise an exception. And when you convert that into a string - why should that raise an exception? You could rather consider returning None, but I think an exception would be too much here. In any case, there is no XML representation of an empty document without root node.
From the LSSerializer: "Note: The serialization of a Node does not always generate a well-formed XML document, i.e. a LSParser might throw fatal errors when
You probably mean "no *well-formed* XML document" here. IMHO I wouldn't restrict the result of tostring() to produce well-formed XML only, since it might be handly for lexical copy&paste stuff. parsing the resulting serialization." But this is the DOM viewpoint, so I dunno if it fits into lxml.etree's semantics.
I think returning an empty string fits both the semantics of an empty result and of having asked for a string. If the user does not expect an empty result, that's her fault (it was her stylesheet after all), so it's up to the user anyway to check the result.
So, IMHO, it's either None or the empty string. Any arguments for using None?
Stefan
Regards, Kasimier
Kasimier Buchcik wrote:
On Wed, 2005-11-09 at 12:49 +0100, Stefan Behnel wrote:
Mechiel Lukkien wrote:
one more remark: with a non-empty document, tostring() generates xml (or so it seems). with an empty document it generates an empty string. is that a valid xml document as well? at least there is no '<?xml version="1.0"?>' in the output. i tried lxml.etree.parse() on an empty file and on a file with just the xml version tag-like line. it raised etree.XMLSyntaxError for both. i'm not sure what tostring() exactly means to do though. if it's "generate valid xml" it might be better to raise some exception in this case.
Maybe the definition of the DOM LSSerializer could be of some help to design the behaviour: http://www.w3.org/TR/DOM-Level-3-LS/load-save.html#LS-LSSerializer
At least this holds for an XSLT output method of "xml"; I guess for "text" this should look different.
Sure, there we go. We can't actually know what should have been generated, so we specifically can't rely on it being XML. This gets me pretty convinced that returning an empty string is perfectly right. The user has to check the result anyway, and an empty string usually is easier to handle than a possible None value that gives you no additional semantic information.
there is no XML representation of an empty document without root node.
You probably mean "no *well-formed* XML document" here. IMHO I wouldn't restrict the result of tostring() to produce well-formed XML only, since it might be handly for lexical copy&paste stuff.
Sure, XSLT is handy for loads of things. That's why the XSLT class has its own tostring method: it knows best what was intended as result type.
From the LSSerializer: "Note: The serialization of a Node does not always generate a well-formed XML document, i.e. a LSParser might throw fatal errors when parsing the resulting serialization."
But this is the DOM viewpoint, so I dunno if it fits into lxml.etree's semantics.
At least it says 'parser', so it's not up to the serializer to complain. :)
I think returning an empty string fits both the semantics of an empty result and of having asked for a string. If the user does not expect an empty result, that's her fault (it was her stylesheet after all), so it's up to the user anyway to check the result.
Quoting myself, but I'm even more convinced now. Stefan
Stefan Behnel wrote: [snip]
I think returning an empty string fits both the semantics of an empty result and of having asked for a string. If the user does not expect an empty result, that's her fault (it was her stylesheet after all), so it's up to the user anyway to check the result.
So, IMHO, it's either None or the empty string. Any arguments for using None?
I think the empty string is fine. XSLT is not guaranteed to deliver parseable XML at all, after all. It could produce any text, including the empty text. Regards, Martijn
participants (4)
-
Kasimier Buchcik
-
Martijn Faassen
-
Mechiel Lukkien
-
Stefan Behnel