[lxml-dev] Stylesheet Processing Instruction

Hi there, Someone asked me if lxml would handle a 'Stylesheet Processing Instruction', which seems to be the way to embed the stylesheet into the XML to be transformed. ie, if you use the said instruction and open the XML in the browser (IE and Firefox?) the browser automatically applies the transform. Since the 'xsltproc' command also seems to do this, from it's man page, I expected lxml to do as well, but didn't actually try. So, can anyone confirm/deny if it's supported? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
You can't always infer the features of lxml from the command line tool xsltproc (and vice versa) as both serve different purposes. That said, I tried to pass a stylesheet to xsltproc using a PI and couldn't manage to start it without passing a separate stylesheet document, so I currently assume the man page to be wrong in that regard. Might be a good idea to ask the same question on the libxslt mailing list. lxml does not currently support this, first point being that we don't have an API for running an XSLT without a stylesheet document. And after a quick glance through the libxslt API, I currently do not see a trivial way to implement that. Stefan

On Mon, Sep 18, 2006 at 06:10:04AM +0200, Stefan Behnel wrote: | You can't always infer the features of lxml from the command line tool | xsltproc (and vice versa) as both serve different purposes. | | That said, I tried to pass a stylesheet to xsltproc using a PI and couldn't | manage to start it without passing a separate stylesheet document, so I | currently assume the man page to be wrong in that regard. Might be a good idea | to ask the same question on the libxslt mailing list. Weird, it works for me here. I've used the example on this page: http://www.xml.com/lpt/a/1102 Saved the XML as 'square.xml', saved the stylesheet as 'squareAsHTML.xsl' in the same folder then ran 'xsltproc square.xml' and it did perform the transformation just fine. | lxml does not currently support this, first point being that we don't have an | API for running an XSLT without a stylesheet document. And after a quick | glance through the libxslt API, I currently do not see a trivial way to | implement that. Maybe if I can point out the lines that does it in the xsltproc source? I'm trying to find those right now. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

| Maybe if I can point out the lines that does it in the xsltproc | source? I'm trying to find those right now. Looks like this is the trick: cur = xsltLoadStylesheetPI(style); if (cur != NULL) { /* it is an embedded stylesheet */ xsltProcess(style, cur, argv[i]); xsltFreeStylesheet(cur); cur = NULL; goto done; } -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
It would of course be good if we didn't do this by default and that this should be enabled explicitly, as it's a potential security concern to pull in stylesheets from URLs. I'm not sure right now what the API would be like at all though in the first place. Regards, Martijn

On Mon, Sep 18, 2006 at 06:34:31PM +0200, Martijn Faassen wrote: | It would of course be good if we didn't do this by default and that this | should be enabled explicitly, as it's a potential security concern to | pull in stylesheets from URLs. I'm not sure right now what the API would | be like at all though in the first place. So, supposing I come up with a reasonable API it wouldn't be hard at all to add it? For a start, how's XInclude handled today? Or maybe it isn't? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Mon, Sep 18, 2006 at 09:30:21PM -0300, Sidnei da Silva wrote: | On Mon, Sep 18, 2006 at 06:34:31PM +0200, Martijn Faassen wrote: | | It would of course be good if we didn't do this by default and that this | | should be enabled explicitly, as it's a potential security concern to | | pull in stylesheets from URLs. I'm not sure right now what the API would | | be like at all though in the first place. FWIW, I think the API should basically be a 'loadStylesheetFromPI' function that takes a tree and returns the same kind of object that etree.XSLT() returns if it can load the stylesheet or None if it can't. I actually tried to go ahead and create a patch for this but lxml doesn't even compile here. On Win32, with Pyrex 0.9.4.1 (with the patch applied) I get this error: src\lxml\etree.c(49704) : error C2137: empty character constant error: command '"C:\Arquivos de programas\Microsoft Visual C++ Toolkit 2003\bin\cl.exe"' failed with exit status 2 The line in question has: if (*t->s == '^@') continue; /* shortcut for erased string entries */ On Ubuntu Edgy, with Pyrex 0.9.3 it doesn't work either. Maybe a non-released version of Pyrex is needed? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Martijn, Sidnei da Silva wrote:
I think the right API would simple be a call to ElementTree.xslt() without stylesheet. Maybe we should then raise an exception if no PI is found.
That's a known bug. Please use the updated Pyrex version from the lxml SVN (see build.txt).
On Ubuntu Edgy, with Pyrex 0.9.3 it doesn't work either.
Sure, it can't handle some of the newer lxml features (and may even have some bugs that prevent it from compiling lxml at all, even older versions). Stefan

On Tue, Sep 19, 2006 at 08:40:10AM +0200, Stefan Behnel wrote: | I think the right API would simple be a call to ElementTree.xslt() without | stylesheet. Maybe we should then raise an exception if no PI is found. Yeah, that would work great. And you also can pass an access_control argument if you wanted to. I've tried going through a similar route (after sorting out the right version of Pyrex) but maybe I'm on the wrong track. I've added a process_pi=True to XSLT() and then called: XSLT(self, process_pi=True) from ElementTree.xslt() and then in XSLT.__init__() I did: if process_pi: c_style = xslt.xsltLoadStylesheetPI(c_doc) else: c_style = xslt.xsltParseStylesheetDoc(c_doc) However that didn't work for some reason. Maybe c_doc at that point doesn't have the processing instruction anymore for some reason. | > if (*t->s == '^@') | > continue; /* shortcut for erased string entries */ | | That's a known bug. Please use the updated Pyrex version from the lxml SVN | (see build.txt). Ok, will do. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Stefan Behnel wrote:
I'd prefer something more explicit myself. I find it a bit dangerous to stop requiring arguments. And then, if you forget the argument, it'll load off stylesheets from some URL somewhere? I'd prefer a separate function to execute the PI-based stylesheet which could indeed have the behavior you suggest. I think it is also valuable to have a way to get to the styleshet object directly, in case someone wants to cache or modify this stylesheet. Regards, Martijn

On Tue, Sep 19, 2006 at 12:27:55PM +0200, Martijn Faassen wrote: | I'd prefer something more explicit myself. I find it a bit dangerous to | stop requiring arguments. And then, if you forget the argument, it'll | load off stylesheets from some URL somewhere? I'd prefer a separate | function to execute the PI-based stylesheet which could indeed have the | behavior you suggest. I think it is also valuable to have a way to get | to the styleshet object directly, in case someone wants to cache or | modify this stylesheet. That was my original proposal *wink*. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
I was just listing use cases and various API worries I have - your original proposal was not backed up by such. I find that this helps making more clear what API we really want. I thinking having an API to pick up the PI automatically would be convenient, but I don't think it should be 'just leave off the parameter of a well-known method' to make it kick in. I also think there are usecases beyond just executing the transformation, listed in my previous mail, which make your original proposal attractive. Regards, Martijn

On Tue, Sep 19, 2006 at 03:42:11PM +0200, Martijn Faassen wrote: | >That was my original proposal *wink*. | | I was just listing use cases and various API worries I have - your | original proposal was not backed up by such. I find that this helps | making more clear what API we really want. I agree with that. I admire you for doing this because it's something I'm not good at. | I thinking having an API to pick up the PI automatically would be | convenient, but I don't think it should be 'just leave off the parameter | of a well-known method' to make it kick in. I also think there are | usecases beyond just executing the transformation, listed in my previous | mail, which make your original proposal attractive. I'm glad we have someone like you Martijn. I've been away from lxml for a couple months only and I'm exceedingly happy to see how it has evolved in the meantime. And it's not different from any other project that you have been involved. Personally, I wish people would hear you more than they do. A certain known application server out there would be in a much better position these days. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
I can't take a lot of credit for that; that's been Stephan Behnel's work mostly, so he deserves the praise. I currently just hang out and give API comments here and there.
Thanks for the compliments, I'm a bit embarassed now, but it's nice to hear. :) It's hard to say whether Zope would be in a much better position; I don't think my words go unheeded that much anyway so I share in the blame for any problems. I'm in the Zope Foundation board now, after all. Anyway, we're not in a terrible position and we're getting better. Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Right.
True, that's not very explicit.
Ok, what we are getting at in that case is that we provide a convenience method for the following code: tree = etree.parse(<some XML with XSLT PI>) root = tree.getroot() pi = root.getprevious() ... read 'href' value from pi.text if 'type' is XSL ... transform = etree.XSLT(etree.parse(href)) transform(tree) The PI attribute parsing part is maybe the most tricky one. Questions: * is it worth special casing? * are there other cases where PIs can or should be special cased? * should we maybe extend the PI class for parsing PI parameters? (-> attrib?) would that put us into a position where special casing is no longer required? Any comments? Stefan

On Tue, Sep 19, 2006 at 06:14:04PM +0200, Stefan Behnel wrote: | Ok, what we are getting at in that case is that we provide a convenience | method for the following code: I think there's an extra step in there: | tree = etree.parse(<some XML with XSLT PI>) | root = tree.getroot() | pi = root.getprevious() | ... read 'href' value from pi.text if 'type' is XSL ... + ... resolve 'href' through resolvers | transform = etree.XSLT(etree.parse(href)) | transform(tree) I believe a CachingResolver then could be used for the caching that Martijn envisioned. | The PI attribute parsing part is maybe the most tricky one. | | Questions: | | * is it worth special casing? Well, since libxml2 provides such an API, there's probably value in it. In fact, we should probably look at the source for xsltLoadStylesheetPI to see if it does anything more than what you're proposing. | * are there other cases where PIs can or should be special cased? For stylesheets specifically, I've saw an example where the stylesheet href was a fragment identifier ('#stylesheet') and the stylesheet was then *in the same XML* instead of a separate file. | * should we maybe extend the PI class for parsing PI parameters? (-> attrib?) | would that put us into a position where special casing is no longer | required? That I can't comment on. I have never really used PI other than for stylesheets. In fact, all the googling that I did seems to indicate it's pretty much the only thing it's used for these days. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Tue, Sep 19, 2006 at 01:22:41PM -0300, Sidnei da Silva wrote: | That I can't comment on. I have never really used PI other than for | stylesheets. In fact, all the googling that I did seems to indicate | it's pretty much the only thing it's used for these days. Actually, let me correct that. Docbook makes extensive use of processing instructions. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Tue, Sep 19, 2006 at 07:55:42PM +0200, Stefan Behnel wrote: | So? Do they use 'normal' XML attributes? Apparently yes. | If yes, then providing a special implementation for PI.attrib might be the | right thing to do. +1 then. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
No, that's already done by parse(), see resolvers.txt.
Bad argument. libxml2 and libxslt have all sorts of redundant APIs.
That could easily be handled via Python resolvers. So, given the fact that the libxml2 API can't actually resolve URLs through Python resolvers, we should consider not using it and instead making the way I sketched above a bit smoother. Stefan

On Tue, Sep 19, 2006 at 07:58:36PM +0200, Stefan Behnel wrote: | > I think there's an extra step in there: | > | > | tree = etree.parse(<some XML with XSLT PI>) | > | root = tree.getroot() | > | pi = root.getprevious() | > | ... read 'href' value from pi.text if 'type' is XSL ... | > + ... resolve 'href' through resolvers | > | transform = etree.XSLT(etree.parse(href)) | > | transform(tree) | | No, that's already done by parse(), see resolvers.txt. Oh, sorry. Didn't spot that. | > In fact, we should probably look at the source for | > xsltLoadStylesheetPI to see if it does anything more than what you're | > proposing. | > | > | * are there other cases where PIs can or should be special cased? | > | > For stylesheets specifically, I've saw an example where the stylesheet | > href was a fragment identifier ('#stylesheet') and the stylesheet was | > then *in the same XML* instead of a separate file. | | That could easily be handled via Python resolvers. | | So, given the fact that the libxml2 API can't actually resolve URLs through | Python resolvers, we should consider not using it and instead making the way I | sketched above a bit smoother. Sure, why not. Martijn, what's your opinion? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Stefan Behnel wrote: [snip]
I'm all for going through Python resolvers. It makes it much easier to be sure we're secure as well, as in that it isn't plucking off URLs from the web when we don't want it to. I realize though that there are libxml2 options to turn this off and on. What's the current situation of our usage of these anyway? Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Actually, libxslt has special code for this in its xsltLoadStylesheetPI function, so that would be the easiest way to support this. However, the function also modifies the document in-place, so we'd have to deep-copy the document before the call in order to keep the stylesheet portion from being modified. Not the most beautiful thing ever...
So, given the fact that the libxml2 API can't actually resolve URLs through Python resolvers,
I double checked this now and it really seems to be the case that libxslt uses the normal resolver callback when loading the PI stylesheet, but we are left without context in this case, so it's hard to determine which set of Python resolvers should actually be used. Normally, when we load documents for XSLT, we have always an XSLT object around that sets up the required context. The libxslt function for PI resolving only passes a URL, so there's no place to keep a context.
libxslt respects the default security prefs, so we could set up the access rules at a per-thread level. Not very satisfactory, IMHO, and the same problem as with the resolvers above. It would be too bad if we ended up calling a method (or utility function) on an ElementTree object that relied on global settings for security settings and URL resolving. We can do much better with an XSLT object here.
XSLT accepts an XSLTAccessControl object in the constructor that holds the security settings: file access read/write, directory creation and network access. I think that's a nice and simple API. I'll also add the respective keyword argument to ET.xslt(), now that I'm at it. So, I feel pretty much convinced that we should side-step the libxslt function and brew our own replacement. Stefan

Hi, so, back to this proposal. Stefan Behnel wrote:
What should this become? I think the beginning is ok so far: tree = etree.parse(<some XML with XSLT PI>) root = tree.getroot() pi = root.getprevious() Maybe we should think about better PI support here. Currently, there is no way to add a PI before the root note, for example. So there should be a better way for handling them in general. Then, if we change PI.attrib/.get/.set to work on the PI text (not sure how to do this right), we could end up with this: href = pi.get("href") type = pi.get("type") if type == "text/xsl": ... That's not too bad, I'd say. And then the rest becomes: transform = etree.XSLT(etree.parse(href)) transform(tree) which supports Python resolvers and access control as usual. The case of an inner-document reference could be handled as follows: href = pi.get("href") if href[:1] == '#': xslt_root = root.xpath("//xsl:stylesheet[@xml:id = $href]", {"xsl":"..."}, href=href) transform = etree.XSLT(xslt_root) or maybe even with XMLDTDID() and the like. We could also think about a custom element class for the PI, such as an XSLProcessingInstruction. The infrastructure for this is mostly in place already. That would let the usage look like this: xslt_pi = root.getprevious() # "xsl-transform" PI xslt_doc = xslt_pi.parseXSL(parser=...) transform = etree.XSLT(xslt_doc) Same support for parsers, resolvers and access control, but a much simpler API that would mainly hide the above code. Any comments? Stefan

On Wed, Sep 20, 2006 at 10:27:01PM +0200, Stefan Behnel wrote: | What should this become? I think the beginning is ok so far: | | tree = etree.parse(<some XML with XSLT PI>) | root = tree.getroot() | pi = root.getprevious() One problem here: there might be more than one PI before the root node I believe. So you would have to look at all of them. | Maybe we should think about better PI support here. Currently, there is no way | to add a PI before the root note, for example. So there should be a better way | for handling them in general. I agree with that. | Then, if we change PI.attrib/.get/.set to work on the PI text (not sure how to | do this right), we could end up with this: | | href = pi.get("href") | type = pi.get("type") | if type == "text/xsl": | ... | | That's not too bad, I'd say. And then the rest becomes: | | transform = etree.XSLT(etree.parse(href)) | transform(tree) | | which supports Python resolvers and access control as usual. The case of an | inner-document reference could be handled as follows: | | href = pi.get("href") | if href[:1] == '#': | xslt_root = root.xpath("//xsl:stylesheet[@xml:id = $href]", | {"xsl":"..."}, href=href) | transform = etree.XSLT(xslt_root) | | or maybe even with XMLDTDID() and the like. Indeed. That's easier than I thought. | We could also think about a custom element class for the PI, such as an | XSLProcessingInstruction. The infrastructure for this is mostly in place | already. That would let the usage look like this: | | xslt_pi = root.getprevious() # "xsl-transform" PI | xslt_doc = xslt_pi.parseXSL(parser=...) | transform = etree.XSLT(xslt_doc) | | Same support for parsers, resolvers and access control, but a much simpler API | that would mainly hide the above code. That would be perfect. The only issue is possibly multiple PI's before the root node and how to find the ones that are xsl-transform. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
Sure, but you'd have to do that anyway. Just use a loop.
Ok, but that's independent of any special support for XSLT-PIs. So there are three topics here: * better support for handling PIs in general * support for parsing attribute-like character sequences in PIs * special support the "xml-stylesheet" PI
hasattr(pi, 'parseXSL') would be a good start, I guess. I committed an implementation of the above to the SVN trunk, please check it out to see if it fits your expectations. Stefan

On Fri, Sep 22, 2006 at 06:48:39PM +0200, Stefan Behnel wrote: | I committed an implementation of the above to the SVN trunk, please check it | out to see if it fits your expectations. Looks like you only wrote tests for the embedded id inclusion. Would be nice to add a test for filename/url inclusion just to make sure. Other than that, looks great to me. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
You can't always infer the features of lxml from the command line tool xsltproc (and vice versa) as both serve different purposes. That said, I tried to pass a stylesheet to xsltproc using a PI and couldn't manage to start it without passing a separate stylesheet document, so I currently assume the man page to be wrong in that regard. Might be a good idea to ask the same question on the libxslt mailing list. lxml does not currently support this, first point being that we don't have an API for running an XSLT without a stylesheet document. And after a quick glance through the libxslt API, I currently do not see a trivial way to implement that. Stefan

On Mon, Sep 18, 2006 at 06:10:04AM +0200, Stefan Behnel wrote: | You can't always infer the features of lxml from the command line tool | xsltproc (and vice versa) as both serve different purposes. | | That said, I tried to pass a stylesheet to xsltproc using a PI and couldn't | manage to start it without passing a separate stylesheet document, so I | currently assume the man page to be wrong in that regard. Might be a good idea | to ask the same question on the libxslt mailing list. Weird, it works for me here. I've used the example on this page: http://www.xml.com/lpt/a/1102 Saved the XML as 'square.xml', saved the stylesheet as 'squareAsHTML.xsl' in the same folder then ran 'xsltproc square.xml' and it did perform the transformation just fine. | lxml does not currently support this, first point being that we don't have an | API for running an XSLT without a stylesheet document. And after a quick | glance through the libxslt API, I currently do not see a trivial way to | implement that. Maybe if I can point out the lines that does it in the xsltproc source? I'm trying to find those right now. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

| Maybe if I can point out the lines that does it in the xsltproc | source? I'm trying to find those right now. Looks like this is the trick: cur = xsltLoadStylesheetPI(style); if (cur != NULL) { /* it is an embedded stylesheet */ xsltProcess(style, cur, argv[i]); xsltFreeStylesheet(cur); cur = NULL; goto done; } -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
It would of course be good if we didn't do this by default and that this should be enabled explicitly, as it's a potential security concern to pull in stylesheets from URLs. I'm not sure right now what the API would be like at all though in the first place. Regards, Martijn

On Mon, Sep 18, 2006 at 06:34:31PM +0200, Martijn Faassen wrote: | It would of course be good if we didn't do this by default and that this | should be enabled explicitly, as it's a potential security concern to | pull in stylesheets from URLs. I'm not sure right now what the API would | be like at all though in the first place. So, supposing I come up with a reasonable API it wouldn't be hard at all to add it? For a start, how's XInclude handled today? Or maybe it isn't? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Mon, Sep 18, 2006 at 09:30:21PM -0300, Sidnei da Silva wrote: | On Mon, Sep 18, 2006 at 06:34:31PM +0200, Martijn Faassen wrote: | | It would of course be good if we didn't do this by default and that this | | should be enabled explicitly, as it's a potential security concern to | | pull in stylesheets from URLs. I'm not sure right now what the API would | | be like at all though in the first place. FWIW, I think the API should basically be a 'loadStylesheetFromPI' function that takes a tree and returns the same kind of object that etree.XSLT() returns if it can load the stylesheet or None if it can't. I actually tried to go ahead and create a patch for this but lxml doesn't even compile here. On Win32, with Pyrex 0.9.4.1 (with the patch applied) I get this error: src\lxml\etree.c(49704) : error C2137: empty character constant error: command '"C:\Arquivos de programas\Microsoft Visual C++ Toolkit 2003\bin\cl.exe"' failed with exit status 2 The line in question has: if (*t->s == '^@') continue; /* shortcut for erased string entries */ On Ubuntu Edgy, with Pyrex 0.9.3 it doesn't work either. Maybe a non-released version of Pyrex is needed? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Martijn, Sidnei da Silva wrote:
I think the right API would simple be a call to ElementTree.xslt() without stylesheet. Maybe we should then raise an exception if no PI is found.
That's a known bug. Please use the updated Pyrex version from the lxml SVN (see build.txt).
On Ubuntu Edgy, with Pyrex 0.9.3 it doesn't work either.
Sure, it can't handle some of the newer lxml features (and may even have some bugs that prevent it from compiling lxml at all, even older versions). Stefan

On Tue, Sep 19, 2006 at 08:40:10AM +0200, Stefan Behnel wrote: | I think the right API would simple be a call to ElementTree.xslt() without | stylesheet. Maybe we should then raise an exception if no PI is found. Yeah, that would work great. And you also can pass an access_control argument if you wanted to. I've tried going through a similar route (after sorting out the right version of Pyrex) but maybe I'm on the wrong track. I've added a process_pi=True to XSLT() and then called: XSLT(self, process_pi=True) from ElementTree.xslt() and then in XSLT.__init__() I did: if process_pi: c_style = xslt.xsltLoadStylesheetPI(c_doc) else: c_style = xslt.xsltParseStylesheetDoc(c_doc) However that didn't work for some reason. Maybe c_doc at that point doesn't have the processing instruction anymore for some reason. | > if (*t->s == '^@') | > continue; /* shortcut for erased string entries */ | | That's a known bug. Please use the updated Pyrex version from the lxml SVN | (see build.txt). Ok, will do. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Stefan Behnel wrote:
I'd prefer something more explicit myself. I find it a bit dangerous to stop requiring arguments. And then, if you forget the argument, it'll load off stylesheets from some URL somewhere? I'd prefer a separate function to execute the PI-based stylesheet which could indeed have the behavior you suggest. I think it is also valuable to have a way to get to the styleshet object directly, in case someone wants to cache or modify this stylesheet. Regards, Martijn

On Tue, Sep 19, 2006 at 12:27:55PM +0200, Martijn Faassen wrote: | I'd prefer something more explicit myself. I find it a bit dangerous to | stop requiring arguments. And then, if you forget the argument, it'll | load off stylesheets from some URL somewhere? I'd prefer a separate | function to execute the PI-based stylesheet which could indeed have the | behavior you suggest. I think it is also valuable to have a way to get | to the styleshet object directly, in case someone wants to cache or | modify this stylesheet. That was my original proposal *wink*. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
I was just listing use cases and various API worries I have - your original proposal was not backed up by such. I find that this helps making more clear what API we really want. I thinking having an API to pick up the PI automatically would be convenient, but I don't think it should be 'just leave off the parameter of a well-known method' to make it kick in. I also think there are usecases beyond just executing the transformation, listed in my previous mail, which make your original proposal attractive. Regards, Martijn

On Tue, Sep 19, 2006 at 03:42:11PM +0200, Martijn Faassen wrote: | >That was my original proposal *wink*. | | I was just listing use cases and various API worries I have - your | original proposal was not backed up by such. I find that this helps | making more clear what API we really want. I agree with that. I admire you for doing this because it's something I'm not good at. | I thinking having an API to pick up the PI automatically would be | convenient, but I don't think it should be 'just leave off the parameter | of a well-known method' to make it kick in. I also think there are | usecases beyond just executing the transformation, listed in my previous | mail, which make your original proposal attractive. I'm glad we have someone like you Martijn. I've been away from lxml for a couple months only and I'm exceedingly happy to see how it has evolved in the meantime. And it's not different from any other project that you have been involved. Personally, I wish people would hear you more than they do. A certain known application server out there would be in a much better position these days. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Sidnei da Silva wrote:
I can't take a lot of credit for that; that's been Stephan Behnel's work mostly, so he deserves the praise. I currently just hang out and give API comments here and there.
Thanks for the compliments, I'm a bit embarassed now, but it's nice to hear. :) It's hard to say whether Zope would be in a much better position; I don't think my words go unheeded that much anyway so I share in the blame for any problems. I'm in the Zope Foundation board now, after all. Anyway, we're not in a terrible position and we're getting better. Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Right.
True, that's not very explicit.
Ok, what we are getting at in that case is that we provide a convenience method for the following code: tree = etree.parse(<some XML with XSLT PI>) root = tree.getroot() pi = root.getprevious() ... read 'href' value from pi.text if 'type' is XSL ... transform = etree.XSLT(etree.parse(href)) transform(tree) The PI attribute parsing part is maybe the most tricky one. Questions: * is it worth special casing? * are there other cases where PIs can or should be special cased? * should we maybe extend the PI class for parsing PI parameters? (-> attrib?) would that put us into a position where special casing is no longer required? Any comments? Stefan

On Tue, Sep 19, 2006 at 06:14:04PM +0200, Stefan Behnel wrote: | Ok, what we are getting at in that case is that we provide a convenience | method for the following code: I think there's an extra step in there: | tree = etree.parse(<some XML with XSLT PI>) | root = tree.getroot() | pi = root.getprevious() | ... read 'href' value from pi.text if 'type' is XSL ... + ... resolve 'href' through resolvers | transform = etree.XSLT(etree.parse(href)) | transform(tree) I believe a CachingResolver then could be used for the caching that Martijn envisioned. | The PI attribute parsing part is maybe the most tricky one. | | Questions: | | * is it worth special casing? Well, since libxml2 provides such an API, there's probably value in it. In fact, we should probably look at the source for xsltLoadStylesheetPI to see if it does anything more than what you're proposing. | * are there other cases where PIs can or should be special cased? For stylesheets specifically, I've saw an example where the stylesheet href was a fragment identifier ('#stylesheet') and the stylesheet was then *in the same XML* instead of a separate file. | * should we maybe extend the PI class for parsing PI parameters? (-> attrib?) | would that put us into a position where special casing is no longer | required? That I can't comment on. I have never really used PI other than for stylesheets. In fact, all the googling that I did seems to indicate it's pretty much the only thing it's used for these days. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Tue, Sep 19, 2006 at 01:22:41PM -0300, Sidnei da Silva wrote: | That I can't comment on. I have never really used PI other than for | stylesheets. In fact, all the googling that I did seems to indicate | it's pretty much the only thing it's used for these days. Actually, let me correct that. Docbook makes extensive use of processing instructions. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

On Tue, Sep 19, 2006 at 07:55:42PM +0200, Stefan Behnel wrote: | So? Do they use 'normal' XML attributes? Apparently yes. | If yes, then providing a special implementation for PI.attrib might be the | right thing to do. +1 then. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
No, that's already done by parse(), see resolvers.txt.
Bad argument. libxml2 and libxslt have all sorts of redundant APIs.
That could easily be handled via Python resolvers. So, given the fact that the libxml2 API can't actually resolve URLs through Python resolvers, we should consider not using it and instead making the way I sketched above a bit smoother. Stefan

On Tue, Sep 19, 2006 at 07:58:36PM +0200, Stefan Behnel wrote: | > I think there's an extra step in there: | > | > | tree = etree.parse(<some XML with XSLT PI>) | > | root = tree.getroot() | > | pi = root.getprevious() | > | ... read 'href' value from pi.text if 'type' is XSL ... | > + ... resolve 'href' through resolvers | > | transform = etree.XSLT(etree.parse(href)) | > | transform(tree) | | No, that's already done by parse(), see resolvers.txt. Oh, sorry. Didn't spot that. | > In fact, we should probably look at the source for | > xsltLoadStylesheetPI to see if it does anything more than what you're | > proposing. | > | > | * are there other cases where PIs can or should be special cased? | > | > For stylesheets specifically, I've saw an example where the stylesheet | > href was a fragment identifier ('#stylesheet') and the stylesheet was | > then *in the same XML* instead of a separate file. | | That could easily be handled via Python resolvers. | | So, given the fact that the libxml2 API can't actually resolve URLs through | Python resolvers, we should consider not using it and instead making the way I | sketched above a bit smoother. Sure, why not. Martijn, what's your opinion? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Stefan Behnel wrote: [snip]
I'm all for going through Python resolvers. It makes it much easier to be sure we're secure as well, as in that it isn't plucking off URLs from the web when we don't want it to. I realize though that there are libxml2 options to turn this off and on. What's the current situation of our usage of these anyway? Regards, Martijn

Hi Martijn, Martijn Faassen wrote:
Actually, libxslt has special code for this in its xsltLoadStylesheetPI function, so that would be the easiest way to support this. However, the function also modifies the document in-place, so we'd have to deep-copy the document before the call in order to keep the stylesheet portion from being modified. Not the most beautiful thing ever...
So, given the fact that the libxml2 API can't actually resolve URLs through Python resolvers,
I double checked this now and it really seems to be the case that libxslt uses the normal resolver callback when loading the PI stylesheet, but we are left without context in this case, so it's hard to determine which set of Python resolvers should actually be used. Normally, when we load documents for XSLT, we have always an XSLT object around that sets up the required context. The libxslt function for PI resolving only passes a URL, so there's no place to keep a context.
libxslt respects the default security prefs, so we could set up the access rules at a per-thread level. Not very satisfactory, IMHO, and the same problem as with the resolvers above. It would be too bad if we ended up calling a method (or utility function) on an ElementTree object that relied on global settings for security settings and URL resolving. We can do much better with an XSLT object here.
XSLT accepts an XSLTAccessControl object in the constructor that holds the security settings: file access read/write, directory creation and network access. I think that's a nice and simple API. I'll also add the respective keyword argument to ET.xslt(), now that I'm at it. So, I feel pretty much convinced that we should side-step the libxslt function and brew our own replacement. Stefan

Hi, so, back to this proposal. Stefan Behnel wrote:
What should this become? I think the beginning is ok so far: tree = etree.parse(<some XML with XSLT PI>) root = tree.getroot() pi = root.getprevious() Maybe we should think about better PI support here. Currently, there is no way to add a PI before the root note, for example. So there should be a better way for handling them in general. Then, if we change PI.attrib/.get/.set to work on the PI text (not sure how to do this right), we could end up with this: href = pi.get("href") type = pi.get("type") if type == "text/xsl": ... That's not too bad, I'd say. And then the rest becomes: transform = etree.XSLT(etree.parse(href)) transform(tree) which supports Python resolvers and access control as usual. The case of an inner-document reference could be handled as follows: href = pi.get("href") if href[:1] == '#': xslt_root = root.xpath("//xsl:stylesheet[@xml:id = $href]", {"xsl":"..."}, href=href) transform = etree.XSLT(xslt_root) or maybe even with XMLDTDID() and the like. We could also think about a custom element class for the PI, such as an XSLProcessingInstruction. The infrastructure for this is mostly in place already. That would let the usage look like this: xslt_pi = root.getprevious() # "xsl-transform" PI xslt_doc = xslt_pi.parseXSL(parser=...) transform = etree.XSLT(xslt_doc) Same support for parsers, resolvers and access control, but a much simpler API that would mainly hide the above code. Any comments? Stefan

On Wed, Sep 20, 2006 at 10:27:01PM +0200, Stefan Behnel wrote: | What should this become? I think the beginning is ok so far: | | tree = etree.parse(<some XML with XSLT PI>) | root = tree.getroot() | pi = root.getprevious() One problem here: there might be more than one PI before the root node I believe. So you would have to look at all of them. | Maybe we should think about better PI support here. Currently, there is no way | to add a PI before the root note, for example. So there should be a better way | for handling them in general. I agree with that. | Then, if we change PI.attrib/.get/.set to work on the PI text (not sure how to | do this right), we could end up with this: | | href = pi.get("href") | type = pi.get("type") | if type == "text/xsl": | ... | | That's not too bad, I'd say. And then the rest becomes: | | transform = etree.XSLT(etree.parse(href)) | transform(tree) | | which supports Python resolvers and access control as usual. The case of an | inner-document reference could be handled as follows: | | href = pi.get("href") | if href[:1] == '#': | xslt_root = root.xpath("//xsl:stylesheet[@xml:id = $href]", | {"xsl":"..."}, href=href) | transform = etree.XSLT(xslt_root) | | or maybe even with XMLDTDID() and the like. Indeed. That's easier than I thought. | We could also think about a custom element class for the PI, such as an | XSLProcessingInstruction. The infrastructure for this is mostly in place | already. That would let the usage look like this: | | xslt_pi = root.getprevious() # "xsl-transform" PI | xslt_doc = xslt_pi.parseXSL(parser=...) | transform = etree.XSLT(xslt_doc) | | Same support for parsers, resolvers and access control, but a much simpler API | that would mainly hide the above code. That would be perfect. The only issue is possibly multiple PI's before the root node and how to find the ones that are xsl-transform. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

Hi Sidnei, Sidnei da Silva wrote:
Sure, but you'd have to do that anyway. Just use a loop.
Ok, but that's independent of any special support for XSLT-PIs. So there are three topics here: * better support for handling PIs in general * support for parsing attribute-like character sequences in PIs * special support the "xml-stylesheet" PI
hasattr(pi, 'parseXSL') would be a good start, I guess. I committed an implementation of the above to the SVN trunk, please check it out to see if it fits your expectations. Stefan

On Fri, Sep 22, 2006 at 06:48:39PM +0200, Stefan Behnel wrote: | I committed an implementation of the above to the SVN trunk, please check it | out to see if it fits your expectations. Looks like you only wrote tests for the embedded id inclusion. Would be nice to add a test for filename/url inclusion just to make sure. Other than that, looks great to me. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
participants (3)
-
Martijn Faassen
-
Sidnei da Silva
-
Stefan Behnel