[lxml-dev] Custom resolvers

Hi, since Paul kept bugging me, I created a new branch (resolver-new) and implemented an API for the custom resolvers stuff. It should be pretty simple to use, just create a parser and register the resolver: parser = XMLParser() parser.resolvers.add(my_resolver) "my_resolver" must be of type etree.Resolver and provide a method resolve(system_url, public_id, context) that returns either None (== "can't resolve, ask someone else") or a _ParserInput object. These can be built from files or strings using the Resolver methods 'resolve_string' and 'resolve_filename'. So, to create a custom resolver, you basically do this --------------- class MyResolver(lxml.etree.Resolver): entity = "This was an entity" def resolve(self, url, id, context): if url == 'my.dtd': # I can handle this return self.resolve_string( u'<!ENTITY myentity "%s">' % self.entity, context) elif url.startswith('http://'): # the default resolver can handle this return super(MyResolver, self).resolve(url, id, context) else: # don't know what to do, let someone else try return None my_resolver = MyResolver() --------------- I'll see how to integrate that in other places of the API, especially XSLT and schemas. Anyway, this works so far. Feel free to comment on it. Stefan

On 20 Apr 2006 at 19:26, Stefan Behnel wrote:
parser = XMLParser() parser.resolvers.add(my_resolver)
Great, so does this resolver only get called when this one parser is used, or is it global to the process (like it is with libxml2)?
I'll see how to integrate that in other places of the API, especially XSLT and schemas. Anyway, this works so far. Feel free to comment on
If I create a parser, add my resolver, then load an .xslt file into that parser, I'd expect that subsequent use of the parsed document in a transform would continue to use my resolver. and that my resolver would not be called by other documents or transforms. Is that what really happens? If so, nirvana! -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
It's currently local to a parser. I'm looking for a module level API also, but I'm not sure yet how to make it look pretty. Anyway, the parser-level API is likely the preferred one anyway.
So you'd want the resolvers stored at a per-document level rather than in XSLT or RelaxNG? That would totally simplify the API. I think that's a good idea. So, just to make that clear: 1) resolvers are only registered with parsers. 2) once a document is parsed, a reference to the parser-local resolvers is kept in the document to be reused in all operations where resolving is involved (XSLT, RelaxNG, XInclude, etc.). Questions: * if you parse an XSL document with one set of resolvers and then use it to transform an XML document with another set of resolvers - which ones should be used during the transform? My guess is: the document ones, but that may break lookups at the XSLT level (which libxslt handles in the standard resolvers, even for lookups inside the stylesheet itself!). Keeping these lookups separated by source document can get pretty hard, I assume. * should the document registries be independent of the parser registries or should they reflect updates in their original parser?
Is that what really happens? If so, nirvana!
Not yet, but close :) Stefan

On 20 Apr 2006 at 20:30, Stefan Behnel wrote:
Is the ability to register a resolver by-parser new functionality in libxml2?
I don't know anything about RelaxNG.. But with respect to xslt.. see below
So, just to make that clear:
1) resolvers are only registered with parsers.
yes
yes
Well hmm.. when does the xsl transform process xsl:include and xsl:import? I think those two statements should use the resolver assigned to the base xslt document. During the transform, calls to document() should use the resolver of the base-uri. So, that could be tricky, the document() call is complicated. I suppose you could say that document() always uses the resolver associated with the source xml file and just leave it at that.. that'd be easy.
* should the document registries be independent of the parser registries or should they reflect updates in their original parser?
sorry, I don't understand what you mean. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
No, lxml registers a global resolver and dispatches internally, possibly falling back to the original default resolver.
Includes and imports are handled at compilation time, which happens in XSLT.__init__(). Libxslt uses a different mechanism than libxml2 here, which (as usual) complicates things. It allows you to specify an "xsltDocLoaderFunction" that is expected to operate in the current XSLT context. Replacing this function would also fix the document('') call as it could access the in-memory stylesheet structure instead of trying to re-load it from a possibly unknown source. However, there doesn't seem to be a way to figure out the default document loader function to provide the necessary fallback. So, I don't know, maybe I'll have to see if libxslt can use the libxml2 resolver capabilities instead...
During the transform, calls to document() should use the resolver of the base-uri.
That's the main problem I see. I'm not sure we can figure out the document that a resolver request comes from by means of libxml2. Libxslt provides this information to the loader function, but as long as we don't have a fallback, we can't just replace the loader function without re-implementing it completely.
Yeah, but it can't always work. Imagine a stylesheet loaded from a ZIP file applied to an XML file loaded from the web. You'd then need both resolvers registered on the XML document. You could possibly imagine using both (e.g. using the XSLT resolvers as a fallback to the XML resolvers). But that may yield other race conditions.
I just meant: should they be stored by reference or copied? But I assume you'd want independent copies to allow updating the parser-local registry without affecting documents that were parsed earlier. So that's a minor problem here. Stefan

Ok, things are getting somewhere... Stefan Behnel wrote:
libxslt cleanly separates XSL compile-time and transformation-time lookups by an argument passed to the loader function. This allows us to use a different set of resolvers for each context. There is an undocumented public reference to the default loader that we can use as fall-back. (See http://www.google.de/search?q=xsltDocDefaultLoader+site%3Axmlsoft.org on why I call it undocumented.) New problem: the default loader of libxslt reuses document references internally, referenced by their URL. I think we should keep this behaviour, which would mean: run the default loader first, and only if that fails dispatch to the Python resolvers. This would disable custom resolvers for file/network URLs etc. but enable it for custom URIs. Those would then even benefit from the internal document reuse. I'm currently using a special prefix "py:" for URIs that are always passed to the custom loaders first. I implemented a preliminary version in the resolver-new branch and unified the API towards libxml2 and libxslt document loaders (it's the same as in my first mail). I don't currently have any test cases, so maybe those who have been waiting for this feature can start playing with it? Note that exception handling is not currently working in XSLT but in the parsers. So, lxml can happily crash if you raise one. That will change - eventually... :) Stefan

On 21 Apr 2006 at 17:19, Stefan Behnel wrote:
This is definitely a non-starter for me. My client's websites serve xml with xslt-pi instructions to web clients. We sniff the client, and if that browser can't support client-side transforms we then perform the transform on the server. In that case, the URL to be resolved is probably already a network URL. I need to be sure that my resolver gets the first crack at it, because I don't want libxslt making a callback to my web server (possibly by a url that the local process doesn't have access to) and definitely occuring outside the context in which it should occur. example.. web requests from authenticated clients with cookies. The cookie won't be passed by libxml2 back to the web server, so authentication is lost. I am using WSGI .. I use the paste.recursive.include module to "re-use" the current web request when handling Resolver callbacks from libxml2. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
That's a reasonable use-case. I removed the first-shot for the default resolver (and a couple of bugs and crashes). This leaves it to users to decide about the trade-off between document re-use and the full flexibility of dynamic document loading. Although I didn't test it, document re-use should now require some additional user code like URL caching: if the document for that URL was already generated, the default resolver should know about it... XSLT document loaders should now be in a preliminary usable state. I'll write up some doctests for the new code (doc/resolvers.txt). That'll also show me if (and where) there are still bugs. Stefan

On 20 Apr 2006 at 19:26, Stefan Behnel wrote:
parser = XMLParser() parser.resolvers.add(my_resolver)
Great, so does this resolver only get called when this one parser is used, or is it global to the process (like it is with libxml2)?
I'll see how to integrate that in other places of the API, especially XSLT and schemas. Anyway, this works so far. Feel free to comment on
If I create a parser, add my resolver, then load an .xslt file into that parser, I'd expect that subsequent use of the parsed document in a transform would continue to use my resolver. and that my resolver would not be called by other documents or transforms. Is that what really happens? If so, nirvana! -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
It's currently local to a parser. I'm looking for a module level API also, but I'm not sure yet how to make it look pretty. Anyway, the parser-level API is likely the preferred one anyway.
So you'd want the resolvers stored at a per-document level rather than in XSLT or RelaxNG? That would totally simplify the API. I think that's a good idea. So, just to make that clear: 1) resolvers are only registered with parsers. 2) once a document is parsed, a reference to the parser-local resolvers is kept in the document to be reused in all operations where resolving is involved (XSLT, RelaxNG, XInclude, etc.). Questions: * if you parse an XSL document with one set of resolvers and then use it to transform an XML document with another set of resolvers - which ones should be used during the transform? My guess is: the document ones, but that may break lookups at the XSLT level (which libxslt handles in the standard resolvers, even for lookups inside the stylesheet itself!). Keeping these lookups separated by source document can get pretty hard, I assume. * should the document registries be independent of the parser registries or should they reflect updates in their original parser?
Is that what really happens? If so, nirvana!
Not yet, but close :) Stefan

On 20 Apr 2006 at 20:30, Stefan Behnel wrote:
Is the ability to register a resolver by-parser new functionality in libxml2?
I don't know anything about RelaxNG.. But with respect to xslt.. see below
So, just to make that clear:
1) resolvers are only registered with parsers.
yes
yes
Well hmm.. when does the xsl transform process xsl:include and xsl:import? I think those two statements should use the resolver assigned to the base xslt document. During the transform, calls to document() should use the resolver of the base-uri. So, that could be tricky, the document() call is complicated. I suppose you could say that document() always uses the resolver associated with the source xml file and just leave it at that.. that'd be easy.
* should the document registries be independent of the parser registries or should they reflect updates in their original parser?
sorry, I don't understand what you mean. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
No, lxml registers a global resolver and dispatches internally, possibly falling back to the original default resolver.
Includes and imports are handled at compilation time, which happens in XSLT.__init__(). Libxslt uses a different mechanism than libxml2 here, which (as usual) complicates things. It allows you to specify an "xsltDocLoaderFunction" that is expected to operate in the current XSLT context. Replacing this function would also fix the document('') call as it could access the in-memory stylesheet structure instead of trying to re-load it from a possibly unknown source. However, there doesn't seem to be a way to figure out the default document loader function to provide the necessary fallback. So, I don't know, maybe I'll have to see if libxslt can use the libxml2 resolver capabilities instead...
During the transform, calls to document() should use the resolver of the base-uri.
That's the main problem I see. I'm not sure we can figure out the document that a resolver request comes from by means of libxml2. Libxslt provides this information to the loader function, but as long as we don't have a fallback, we can't just replace the loader function without re-implementing it completely.
Yeah, but it can't always work. Imagine a stylesheet loaded from a ZIP file applied to an XML file loaded from the web. You'd then need both resolvers registered on the XML document. You could possibly imagine using both (e.g. using the XSLT resolvers as a fallback to the XML resolvers). But that may yield other race conditions.
I just meant: should they be stored by reference or copied? But I assume you'd want independent copies to allow updating the parser-local registry without affecting documents that were parsed earlier. So that's a minor problem here. Stefan

Ok, things are getting somewhere... Stefan Behnel wrote:
libxslt cleanly separates XSL compile-time and transformation-time lookups by an argument passed to the loader function. This allows us to use a different set of resolvers for each context. There is an undocumented public reference to the default loader that we can use as fall-back. (See http://www.google.de/search?q=xsltDocDefaultLoader+site%3Axmlsoft.org on why I call it undocumented.) New problem: the default loader of libxslt reuses document references internally, referenced by their URL. I think we should keep this behaviour, which would mean: run the default loader first, and only if that fails dispatch to the Python resolvers. This would disable custom resolvers for file/network URLs etc. but enable it for custom URIs. Those would then even benefit from the internal document reuse. I'm currently using a special prefix "py:" for URIs that are always passed to the custom loaders first. I implemented a preliminary version in the resolver-new branch and unified the API towards libxml2 and libxslt document loaders (it's the same as in my first mail). I don't currently have any test cases, so maybe those who have been waiting for this feature can start playing with it? Note that exception handling is not currently working in XSLT but in the parsers. So, lxml can happily crash if you raise one. That will change - eventually... :) Stefan

On 21 Apr 2006 at 17:19, Stefan Behnel wrote:
This is definitely a non-starter for me. My client's websites serve xml with xslt-pi instructions to web clients. We sniff the client, and if that browser can't support client-side transforms we then perform the transform on the server. In that case, the URL to be resolved is probably already a network URL. I need to be sure that my resolver gets the first crack at it, because I don't want libxslt making a callback to my web server (possibly by a url that the local process doesn't have access to) and definitely occuring outside the context in which it should occur. example.. web requests from authenticated clients with cookies. The cookie won't be passed by libxml2 back to the web server, so authentication is lost. I am using WSGI .. I use the paste.recursive.include module to "re-use" the current web request when handling Resolver callbacks from libxml2. -- Brad Clements, bkc@murkworks.com (315)268-1000 http://www.murkworks.com AOL-IM or SKYPE: BKClements

Brad Clements wrote:
That's a reasonable use-case. I removed the first-shot for the default resolver (and a couple of bugs and crashes). This leaves it to users to decide about the trade-off between document re-use and the full flexibility of dynamic document loading. Although I didn't test it, document re-use should now require some additional user code like URL caching: if the document for that URL was already generated, the default resolver should know about it... XSLT document loaders should now be in a preliminary usable state. I'll write up some doctests for the new code (doc/resolvers.txt). That'll also show me if (and where) there are still bugs. Stefan
participants (2)
-
Brad Clements
-
Stefan Behnel