[lxml-dev] document('') fixed
Hi, I played with the XSLT document loaders and found that the default loader can apparently handle "document('')" on XSL documents read from strings as long as they have a non-empty URL. This only makes sense when you know that libxslt keeps a list of known documents during the transformation, so it apparently searches that list for the URL of the requested document. I changed the code on the trunk to create a fake URL for the case that the document URL is empty. So, document('') should now work from any stylesheet (if anyone wants to verify...) Stefan
Hi,
--- Ursprüngliche Nachricht --- Von: Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> An: ML-Lxml-dev <lxml-dev@codespeak.net> Betreff: [lxml-dev] document('') fixed Datum: Fri, 21 Apr 2006 13:17:05 +0200
Hi,
I played with the XSLT document loaders and found that the default loader can apparently handle "document('')" on XSL documents read from strings as long as they have a non-empty URL. This only makes sense when you know that libxslt keeps a list of known documents during the transformation, so it apparently searches that list for the URL of the requested document.
I changed the code on the trunk to create a fake URL for the case that the document URL is empty. So, document('') should now work from any stylesheet (if anyone wants to verify...)
Does it still reparse the stylesheet document? If you managed to reuse the stylesheet-tree for this purpose then this will produce problems, since the stylesheet-compilation process of Libxslt will change the tree; i.e., e.g. it will eliminate xsl:text elements and preserve whitespace-only text-nodes if they are children of xsl:text. Regards, Kasimier
cazic@gmx.net wrote:
Stefan Behnel wrote:
I changed the code on the trunk to create a fake URL for the case that the document URL is empty. So, document('') should now work from any stylesheet (if anyone wants to verify...)
Does it still reparse the stylesheet document?
I'm not changing anything here, I'm only providing a URL for the stylesheet, which already exists for stylesheets read from files or the network. I tried this: ----------------------
from lxml.etree import XSLT,XML xml = XML("""\ ... <stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"> ... <template match="/"> ... <copy-of select="document('')/*/*"/> ... </template> ... </stylesheet>""") xslt=XSLT(xml) str(xslt(xml) '<?xml version="1.0"?>\n<template xmlns="http://www.w3.org/1999/XSL/Transform" match="/"><copy-of select="document(\'\')/*/*"/></template>\n'
The output is all in one line. strace tells me that it tries to find the fake file and fails. It then checks a catalog in /etc/xml and then retries finding the fake file (which fails again). However, it then returns the above tree, so there must be a fallback somewhere that lets document('') succeed.
If you managed to reuse the stylesheet-tree for this purpose then this will produce problems, since the stylesheet-compilation process of Libxslt will change the tree; i.e., e.g. it will eliminate xsl:text elements and preserve whitespace-only text-nodes if they are children of xsl:text.
That would produce the above output, yes. So, what you say is that we should rather handle the lookup "manually"? That would require copying the document twice before the XSLT compilation, to use one copy for compilation and to store the other one. The doc loader would then return a copy of the second copy when the stylesheet URL is requested. Is that the correct approach? That would really make it a lot of deep copying. If this is really necessary, would you mind if I called this behaviour a bug in libxslt? Stefan
Hi,
--- Ursprüngliche Nachricht --- Von: Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> An: cazic@gmx.net Kopie: lxml-dev@codespeak.net Betreff: Re: [lxml-dev] document('') fixed Datum: Fri, 21 Apr 2006 14:39:53 +0200
[...]
rather handle the lookup "manually"? That would require copying the document twice before the XSLT compilation, to use one copy for compilation and to store the other one. The doc loader would then return a copy of the second copy when the stylesheet URL is requested.
Is that the correct approach? That would really make it a lot of deep copying. If this is really necessary, would you mind if I called this behaviour a bug in libxslt?
For whitespace-stripping see: http://www.w3.org/TR/xslt#strip or the XSLT 2.0 spec, which clarifies the intended behaviour much better: http://www.w3.org/TR/xslt20/#stylesheet-stripping The elimination of xsl:text elements is a Libxslt-only thingy, but it's just an internal processing like pre-compilation of XPath expressions. I learned that the spec of XSLT 2.0 clarifies the semantics of the document() function (which, as I was told, was introduced in an abandoned draft of XSLT 1.1 and never made it into the recommendation): "One effect of these rules is that unless XML entities or xml:base are used, and provided that the base URI of the stylesheet module is known, document("") refers to the document node of the containing stylesheet module (the definitive rules are in [RFC3986]). The XML resource containing the stylesheet module is processed exactly as if it were any other XML document, for example there is no special recognition of xsl:text elements, and no special treatment of comments and processing instructions." (http://www.w3.org/TR/xslt20/#document) So this mechanism relies on a base URI to be known, which is not known if the stylesheet-tree is constructed from an in-memory string. I haven't read RFC3986, but an interesting question for me is, whether the *string* containing the XML, could be be treated as the document and be addressed/acquired via the document("") function. So if you could tweak lxml to keep a reference to that string, and feed Libxslt with it when document("") is called, that would be a nice solution, I think. Regards, Kasimier
cazic@gmx.net wrote:
For whitespace-stripping see: http://www.w3.org/TR/xslt#strip
or the XSLT 2.0 spec, which clarifies the intended behaviour much better: http://www.w3.org/TR/xslt20/#stylesheet-stripping
The elimination of xsl:text elements is a Libxslt-only thingy, but it's just an internal processing like pre-compilation of XPath expressions. [snip] So this mechanism relies on a base URI to be known, which is not known if the stylesheet-tree is constructed from an in-memory string.
Ok, I understand that there are certain minor changes in the stylesheet structure, mainly for white-space nodes and xsl:text elements. I personally don't think this is worth storing XML data and copying documents all over the place. Since most people will use document() only to a) find documents in the same directory as the stylesheet (which works anyway) or b) access data in the stylesheet (as opposed to templates, etc.), I can't see why it should hurt anyone to just leave it as it is now. Even the white-space stripping stuff will presumably only show surprising results in very rare cases. So, my preferred solutions is to just let document('') access the stylesheet and "maybe" collect some possible surprising effects somewhere in the documentation. Everything else would be too much overhead in the average case (and for the programmer :). Stefan
Hi,
--- Ursprüngliche Nachricht --- Von: Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> An: cazic@gmx.net Kopie: lxml-dev@codespeak.net Betreff: Re: [lxml-dev] document('') fixed Datum: Fri, 21 Apr 2006 17:00:58 +0200
[...]
anyone to just leave it as it is now. Even the white-space stripping stuff will presumably only show surprising results in very rare cases.
So, my preferred solutions is to just let document('') access the stylesheet and "maybe" collect some possible surprising effects somewhere in the documentation. Everything else would be too much overhead in the average case (and for the programmer :).
Well, we could strip processing-instructions by default in the Libxml2-parser; I don't use them and I think they are rarely used out there anyway ;-) Regards, Kasimier
cazic@gmx.net wrote:
Stefan Behnel wrote:
So, my preferred solutions is to just let document('') access the stylesheet and "maybe" collect some possible surprising effects somewhere in the documentation. Everything else would be too much overhead in the average case (and for the programmer :).
Well, we could strip processing-instructions by default in the Libxml2-parser; I don't use them and I think they are rarely used out there anyway ;-)
True. But don't forget to document that somewhere in the source code! ;) Stefan
Hi, previously, I wrote:
rather handle the lookup "manually"? That would require copying the document twice before the XSLT compilation, to use one copy for compilation and to store the other one. The doc loader would then return a copy of the second copy when the stylesheet URL is requested.
I revised my previous opinion on this. The current code now uses exactly this approach. Storing the string or a filename reference would not have solved the problem as there is nothing that keeps a user from building stylesheets by hand using the API. Alternatively, we could serialize the XSL to a string before compiling it and parse it on request. Daniel suggested that this might even be faster than deep-copying. I wouldn't mind hearing other opinions on this. Anyway, this is how it works now. Stylesheets that were parsed from strings are now special cased and a fake URI is generated for them. The lookup works as follows (first match wins): 1) if the requested URI is a fake URI a) the default resolver is asked to find the document b) the URI is checked against the current XSL document 2) the Python resolvers are called 3) the default resolver is called 4) fail This allows document('') to work in all cases (cross-fingers) and prefers the Python resolvers for anything but string-loaded stylesheets. I think that's a good trade-off. Doctests and explanations can be found in doc/resolvers.txt. Remember: The more feedback I get, the faster the branch can be merged into the trunk. If anyone can come up with additional doctests, clarifications or unit test cases, that would be much appreciated. Stefan
Hi, one more comment on this:
Alternatively, we could serialize the XSL to a string before compiling it and parse it on request. Daniel suggested that this might even be faster than deep-copying.
I did a couple of tests and found that this is much slower for small stylesheets. It may also carry the additional risk of requiring special parser options that may not be known to XSLT. I now simplified the code somewhat and special cased only the current stylesheet itself. The rest is handed to the Python loaders and subsequently to the default loader. If there are no substantial counter-arguments to this behaviour, I'll just wait for a few bug reports and otherwise merge it next week. Stefan
participants (2)
-
cazic@gmx.net
-
Stefan Behnel