[lxml-dev] xinclude bug?
I'm working on a project that will use lxml's xinclude functionality to insert the contents of python files into an xml document and have noticed a possible bug. When you xinclude with the parse attribute set to "text", the text frequently (though not always) gets loaded into multiple adjacent text nodes, so that if you access the text attribute of a containing element, you only get part of the actual text. You can verify this by calling element.xpath('text()') on the container... you get back a list with multiple elements. Is this how things are supposed to work? Also, escaping seems to occur in strings accessed from the "text" attribute of xincluded content, but not in strings retrieved via xpath, as described above. Is there a reliable way to reverse the escaping process, so that the original contents of the xincluded file can be retrieved? I assume that xml.sax.saxutils.unescape() would work, but don't know for sure. I've pasted some example code below to demonstrate the seemingly broken "text" attribute use and the different escaping styles. Thanks, Greg doc.xml ----------------------------- <?xml version="1.0" encoding="UTF-8"?> <doc xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="doc.py" parse="text"/> </doc> doc.py --------------------------- #!/usr/bin/python s1 = '3 < 4' s2 = "hello;" test.py -------------------------- from lxml import etree tree = lxml.parse('doc.xml') tree.xinclude () root = tree.getroot() print repr(root.text) print '----' print root.xpath('text()')
Hi Greg, thanks for reporting this. Greg Steffensen wrote:
When you xinclude with the parse attribute set to "text", the text frequently (though not always) gets loaded into multiple adjacent text nodes, so that if you access the text attribute of a containing element, you only get part of the actual text. You can verify this by calling
element.xpath('text()')
on the container... you get back a list with multiple elements. Is this how things are supposed to work?
That's not quite what your example below shows. You take this document:
<?xml version="1.0" encoding="UTF-8"?> <doc xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="doc.py " parse="text"/> </doc>
Note the whitespace around the include element. What I get for the XPath call after running the include is: ['\n ', '#!/usr/bin/python\n\ns1 = \'3 < 4\'\ns2 = "hello;"\n', '\n'] So the new text was correctly added between the two existing text nodes. Now, what happens internally is that libxml2 adds special xinclude nodes around the included part as a kind of marker. So, when we collect text nodes for the ".text" property, we stop at the xinclude nodes and only regard the text before them. This results in what you see for the text property: '\n ' I consider this a bug in lxml. I think we should step over xinclude nodes when collecting text content.
Also, escaping seems to occur in strings accessed from the "text" attribute of xincluded content, but not in strings retrieved via xpath,
I'm not quite sure what you mean here. Could you give an example? I mean, the above XPath result has normal text content, not escaped in any way. Stefan
Hi again, Stefan Behnel wrote:
<?xml version="1.0" encoding="UTF-8"?> <doc xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="doc.py " parse="text"/> </doc>
Note the whitespace around the include element. What I get for the XPath call after running the include is:
['\n ', '#!/usr/bin/python\n\ns1 = \'3 < 4\'\ns2 = "hello;"\n', '\n']
So the new text was correctly added between the two existing text nodes. Now, what happens internally is that libxml2 adds special xinclude nodes around the included part as a kind of marker. So, when we collect text nodes for the ".text" property, we stop at the xinclude nodes and only regard the text before them. This results in what you see for the text property:
'\n '
I consider this a bug in lxml. I think we should step over xinclude nodes when collecting text content.
I fixed this. When collecting text nodes, we now step over xinclude nodes and continue. Stefan
participants (2)
-
Greg Steffensen
-
Stefan Behnel