[lxml-dev] xpath extension functions

Hi there, First of all, thanks Kapil and Marc-Antoine for the discussion about xpath extension functions. I hope you don't mind me admitting I'm somewhat overwhelmed. I thought a good way to look at it now would be from the perspective of the API developers see, and write down some concrete use cases. I understand that XPath extension functions are registered against an xmlXPathContext, which in turn is created from a document. Namespaces are potentially registered against this context, and you can register extension functions. Currently the xpath API is a simple convenience function against Element and the ElementTree. That is pretty nice and easy, but once you're interested in more sophisticated use cases, it makes more sense to expose a separate XPath object, similar to the way we alread have an XSLT object and a RelaxNG object. I can imagine an interaction like this:
Registration of functions could work as a dictionary too:
xpath.registerFunctions(function_dict) # function name : python func
Or an alternate, simpler, API with less methods, could be this:
xpath = XPath(doc, namespace_dict, extension_func_dict)
To pass along a context element to an XPath evaluation, we could allow this:
results = xpath.evaluate('p', el)
We can easily reimplement the existing xpath methods in terms of this functionality. They'd just be convenience functions which you shouldn't use if you're interested in calling a lot of xpath expressions against the same document. If you can have extension functions in dictionaries, registering them for more than one xpath context wouldn't be very hard to do at all. I think that's good enough for now and we don't need to worry hard about global registration. If it turns out we want it, it wouldn't be hard to implement an API to manipulate a global dictionary that gets put in whenever an XPath object is initialized (and then potentially overwritten by the functin dictionary argument). But all that is easy enough to do yourself, so I'd say let's skip it for now. I've skipped over the arguments such an extension function should be able to handle. What kind of a information does an extension function typically need? Let's give it the minimum amount of information needed, and not overwhelm it with libxml2 complexity. :) Exceptions raised in extension functions should definitely pop up to the Python level. I understand that this is complicated to do. Would it be doable to keep the information about the last exception raised on the XPath object? How this ties into XSLT processing I don't know yet. Perhaps we'll have a similar extension function registration API on the XSLT object and that could be enough... I'm not sure how all this ties into various threading issues Kapil brought up. Kapil, would this design work okay for you? Marc-Antoine, thanks again for all the work and please let me what you think about this design, and what I'm still missing. :) Regards, Martijn

OK. I finally took some time to read this.
Overall, it is very close to what I had in mind, except that I would not call that class XPath. I understand it wraps XPath functionality, but what you describe is very much what a xmlXPathContext does... i.e. bind namespaces and extension functions! So I suggest we call the class XPathContext.... I read your whole document with s/XPath/XPathContext/g; ;->
xpath.registerNamespaces(namespace_dict) xpath.registerFunction('foo', f)
I had already written something closer to
results = xpath.evaluate('//p')
OK, here you assume relative to the document. The context also allows the idea of current node, which is a good thing when you work within an extension function. So I would have a syntax
results = xpath.evaluate('//p', node) which accepts a given node; by default the context's current location.
On that note, if you want to reuse functions and namespaces, I would allow a clone function:
xpathContext2 = xpathContext.cloneWithDoc(doc) or something like that.
An upside would be the following: Assume that we store the (Pyrex) XPathContext object in the void* userData in the (C) xmlXPathContext. Recall that (Python) extension functions receive the (C) xmlXPathParserContext (as a Pyrex object of course), and hence can access the (C) xmlXPathContext, so we could give them access to the (Pyrex) XPathContext. That way, Kapil can store user data in some object variable in the (per-session) clone of the XPathContext; and make all the clones from a single one which is configured with namespaces and extensions. Also, each document gets to reuse a XPathContext, which means that we do not have to set one up each time we evaluate a xpath. Finally, if we want to use a XPathContext with a different origin from within an extension function, I see two solutions: The first is, again, to clone the context: xpathContext2 = xpathContext.cloneWithNode(node) (so the target node would be read-only) The second way could be to save the target node in a local variable while calling the extension function, in case it messes the target node. (Much less happy about that option.) Finally, another thing I would add to the xpathContext API is the option to declare variables (again as a dictionary) that can be read (only) by extension functions. Another way for Kapil to do things... though there he only gets a literal or node, not a Python object. Still, a literal can also be a key into a thread-global dictionary of session objects. Overall, I think this can fly.
Yes... I do like this, but let us look at the XSLT philosophy, which assumes modules, before we do too much that is incompatible... So I am a tad less sure here. The way I see it, we should actually be able to associate extension functions and elements with a namespace. A module allows that in a neat fashion: a module encapsulates a namespace URI, extension functions and elements, and management function that are called at beginning/end of module setup and document transform respectively. (And yes, I have found both of those useful!) Also, modules are very much registered globally. (They are associated to a local name at the level of an individual transform, however.) SO.... One option is to actually reuse the existing libxslt.extensionModule class. It is pure python, and allows people (like me) who have already built such things to reuse them, which I think is a good thing ;-) Here is the syntax for this class (abbreviated) class extensionModule: def styleInit(self, stylesheet, URI): def styleShutdown(self, stylesheet, URI, data): def ctxtInit(self, transformCtxt, URI): def ctxtShutdown(self, transformCtxt, URI, data): In theory, the style init must register the module's own extension functions and elements. When I used this myself, I used a syntax convention: methods starting with 'xpath' were xpath extension functions and methods starting with 'xsl' were xslt extension elements. Of course, that is not good enough for an API for public consumption. I think that maybe, we should add decorators for either case, and people can use them with 2.3 syntax if desired. I also used to convert Python camel-caps names to xsl hyphenated lower case names by default. That is nice as a default behaviour, but of course we should allow more flexibility if the users want to use their own names. So here is what we would have: # This is private. Could be made static methods to extensionModule? def xpathExtensionFunction(func, name=None, uri=None): # decorator if not name: name = convertToHyphenated(func.__name__) if not uri: try: uri = A.f.im_class.getURI() except: pass registerXPathExtensionFunction(f, name, uri,....) def xsltExtensionElement(func) # decorator #this is public class extensionModule: @classmethod def getURI(class): pass def styleInit(self, stylesheet, URI): pass def styleShutdown(self, stylesheet, URI, data): pass def ctxtInit(self, transformCtxt, URI): pass def ctxtShutdown(self, transformCtxt, URI, data): pass # this is user code class MyModule(extensionModule): @classmethod def getURI: return "http://www.example.com/" @xpathFunction(name='my-own-optional-name') def myFunction(self, xslXPathContext, args): pass ... # global registration for a module (as opposed to a function) etree.registerModule(MyModule) # for stylesheet: bind to a local name. (Should it do the global registration if needed?) myModule = MyModule() myModule.userData = x # if desired; the module-bound extension functions get "self". xsl.registerModule(myModule, localName) # (Note: I have to check that passing the module instance to the xpathExtensionFunctions is not also using the userData in the xmlXPathContext, which I intend for the Pyrex XPathContext...) Now, we should also allow xpathContext.registerModule(myModule, localName) (including with a URI of None.)
Or an alternate, simpler, API with less methods, could be this:
xpath = XPath(doc, namespace_dict, extension_func_dict)
Great. Let us also think of a version with a module. I would argue for named arguments.
Less enthusiastic about this one. As I said, I would like the second argument to be the target node. Maybe again with a named parameter?
I will look at this option, yes. It should work much better than the silly global I sent you at first. Marc-Antoine

Marc-Antoine Parent wrote:
It's indeed probably just an xmlXPathContext wrapper. I don't think we need to bother any developer with talk about a 'Context'; it's a superfluous term carrying over from libxml2 in my mind, just like the API doesn't mention RelaxNG contexts or XSLT contexts. Unless a class 'XPath' brings something else to mind and confuses you, I want to go with XPath. :)
How would the name otherwise be deduced? If we make the name non-optional, they can be dictionary keys...
I though I described that later on in my document. :)
Yeah, a clone() could be doable, and is a reasonable idea, but I won't worry about it for now. If initializing the XPath object is as easy as putting in a document and two dictionaries (one for namespaces, one for functions), I think people can do that themselves. I'm looking to cut out the API we really need first.
I'm not sure I understand. Why not provide a new object altogether for UserData? I don't see why this need be the task of the XPath context. Could be a third argument to evaluate(), perhaps?
This is a separate idea from the above user data story, right? I see this as an optimization of the .xpath method. It would require a way to see the XPath object of the document to be seeded with namespaces through a separate API on the document or something.. An alternative is just to do away with the .xpath method altogether and require people to use XPath() directly, which would make it harder for people to make performance mistakes. I'm not sure, the convenience it offers right now, especially for evaluating in element contexts, is pretty nice.
Finally, if we want to use a XPathContext with a different origin from within an extension function, I see two solutions:
What is a an XPathContext with a different origin and why would we want to do that? You mean to have an extension function do its own XPath evaluation? I can see here why cloning would be convenient, as you wouldn't need to figure out the extension functions anymore to set up; presumably you'd want to use the same set as before.
What is the target node?
Having some way to pass along Python objects to extension functions would be nice. This is separate from the XPath $variable concept, which we'll also need to support.
I think worrying about XSLT when we get to it would be fine. A dictionary of extension functions could be turned into some kind of module, right?
The way I see it, we should actually be able to associate extension functions and elements with a namespace. A
Extension elements are a separate story again, right? Can extension functions have a namespace?
I won't allow anything like the libxml2-style APIs near the lxml API. :) I'm sure there are useful concepts in there, but I first want to tackle XPath. Then we'll look at XSLT. If we keep the XPath API as minimal and simple as we can (for the Python developer using lxml), we should be able to translate some of those concepts into the XSLT API. Anyway, I'll skip the XSLT part for now until we've implemented an XPath API. I think I'll sit down with your patch sometime soon and try to build up the API we've sketched here, at least in basic form. Then hopefully you can give feedback on how to tackle the tricky bits, of which there are many. [snip rest of long mail which I'll look at later; I see there's some XPath stuff mixed in that I need to read too] Regards, Martijn

OK. I finally took some time to read this.
Overall, it is very close to what I had in mind, except that I would not call that class XPath. I understand it wraps XPath functionality, but what you describe is very much what a xmlXPathContext does... i.e. bind namespaces and extension functions! So I suggest we call the class XPathContext.... I read your whole document with s/XPath/XPathContext/g; ;->
xpath.registerNamespaces(namespace_dict) xpath.registerFunction('foo', f)
I had already written something closer to
results = xpath.evaluate('//p')
OK, here you assume relative to the document. The context also allows the idea of current node, which is a good thing when you work within an extension function. So I would have a syntax
results = xpath.evaluate('//p', node) which accepts a given node; by default the context's current location.
On that note, if you want to reuse functions and namespaces, I would allow a clone function:
xpathContext2 = xpathContext.cloneWithDoc(doc) or something like that.
An upside would be the following: Assume that we store the (Pyrex) XPathContext object in the void* userData in the (C) xmlXPathContext. Recall that (Python) extension functions receive the (C) xmlXPathParserContext (as a Pyrex object of course), and hence can access the (C) xmlXPathContext, so we could give them access to the (Pyrex) XPathContext. That way, Kapil can store user data in some object variable in the (per-session) clone of the XPathContext; and make all the clones from a single one which is configured with namespaces and extensions. Also, each document gets to reuse a XPathContext, which means that we do not have to set one up each time we evaluate a xpath. Finally, if we want to use a XPathContext with a different origin from within an extension function, I see two solutions: The first is, again, to clone the context: xpathContext2 = xpathContext.cloneWithNode(node) (so the target node would be read-only) The second way could be to save the target node in a local variable while calling the extension function, in case it messes the target node. (Much less happy about that option.) Finally, another thing I would add to the xpathContext API is the option to declare variables (again as a dictionary) that can be read (only) by extension functions. Another way for Kapil to do things... though there he only gets a literal or node, not a Python object. Still, a literal can also be a key into a thread-global dictionary of session objects. Overall, I think this can fly.
Yes... I do like this, but let us look at the XSLT philosophy, which assumes modules, before we do too much that is incompatible... So I am a tad less sure here. The way I see it, we should actually be able to associate extension functions and elements with a namespace. A module allows that in a neat fashion: a module encapsulates a namespace URI, extension functions and elements, and management function that are called at beginning/end of module setup and document transform respectively. (And yes, I have found both of those useful!) Also, modules are very much registered globally. (They are associated to a local name at the level of an individual transform, however.) SO.... One option is to actually reuse the existing libxslt.extensionModule class. It is pure python, and allows people (like me) who have already built such things to reuse them, which I think is a good thing ;-) Here is the syntax for this class (abbreviated) class extensionModule: def styleInit(self, stylesheet, URI): def styleShutdown(self, stylesheet, URI, data): def ctxtInit(self, transformCtxt, URI): def ctxtShutdown(self, transformCtxt, URI, data): In theory, the style init must register the module's own extension functions and elements. When I used this myself, I used a syntax convention: methods starting with 'xpath' were xpath extension functions and methods starting with 'xsl' were xslt extension elements. Of course, that is not good enough for an API for public consumption. I think that maybe, we should add decorators for either case, and people can use them with 2.3 syntax if desired. I also used to convert Python camel-caps names to xsl hyphenated lower case names by default. That is nice as a default behaviour, but of course we should allow more flexibility if the users want to use their own names. So here is what we would have: # This is private. Could be made static methods to extensionModule? def xpathExtensionFunction(func, name=None, uri=None): # decorator if not name: name = convertToHyphenated(func.__name__) if not uri: try: uri = A.f.im_class.getURI() except: pass registerXPathExtensionFunction(f, name, uri,....) def xsltExtensionElement(func) # decorator #this is public class extensionModule: @classmethod def getURI(class): pass def styleInit(self, stylesheet, URI): pass def styleShutdown(self, stylesheet, URI, data): pass def ctxtInit(self, transformCtxt, URI): pass def ctxtShutdown(self, transformCtxt, URI, data): pass # this is user code class MyModule(extensionModule): @classmethod def getURI: return "http://www.example.com/" @xpathFunction(name='my-own-optional-name') def myFunction(self, xslXPathContext, args): pass ... # global registration for a module (as opposed to a function) etree.registerModule(MyModule) # for stylesheet: bind to a local name. (Should it do the global registration if needed?) myModule = MyModule() myModule.userData = x # if desired; the module-bound extension functions get "self". xsl.registerModule(myModule, localName) # (Note: I have to check that passing the module instance to the xpathExtensionFunctions is not also using the userData in the xmlXPathContext, which I intend for the Pyrex XPathContext...) Now, we should also allow xpathContext.registerModule(myModule, localName) (including with a URI of None.)
Or an alternate, simpler, API with less methods, could be this:
xpath = XPath(doc, namespace_dict, extension_func_dict)
Great. Let us also think of a version with a module. I would argue for named arguments.
Less enthusiastic about this one. As I said, I would like the second argument to be the target node. Maybe again with a named parameter?
I will look at this option, yes. It should work much better than the silly global I sent you at first. Marc-Antoine

Marc-Antoine Parent wrote:
It's indeed probably just an xmlXPathContext wrapper. I don't think we need to bother any developer with talk about a 'Context'; it's a superfluous term carrying over from libxml2 in my mind, just like the API doesn't mention RelaxNG contexts or XSLT contexts. Unless a class 'XPath' brings something else to mind and confuses you, I want to go with XPath. :)
How would the name otherwise be deduced? If we make the name non-optional, they can be dictionary keys...
I though I described that later on in my document. :)
Yeah, a clone() could be doable, and is a reasonable idea, but I won't worry about it for now. If initializing the XPath object is as easy as putting in a document and two dictionaries (one for namespaces, one for functions), I think people can do that themselves. I'm looking to cut out the API we really need first.
I'm not sure I understand. Why not provide a new object altogether for UserData? I don't see why this need be the task of the XPath context. Could be a third argument to evaluate(), perhaps?
This is a separate idea from the above user data story, right? I see this as an optimization of the .xpath method. It would require a way to see the XPath object of the document to be seeded with namespaces through a separate API on the document or something.. An alternative is just to do away with the .xpath method altogether and require people to use XPath() directly, which would make it harder for people to make performance mistakes. I'm not sure, the convenience it offers right now, especially for evaluating in element contexts, is pretty nice.
Finally, if we want to use a XPathContext with a different origin from within an extension function, I see two solutions:
What is a an XPathContext with a different origin and why would we want to do that? You mean to have an extension function do its own XPath evaluation? I can see here why cloning would be convenient, as you wouldn't need to figure out the extension functions anymore to set up; presumably you'd want to use the same set as before.
What is the target node?
Having some way to pass along Python objects to extension functions would be nice. This is separate from the XPath $variable concept, which we'll also need to support.
I think worrying about XSLT when we get to it would be fine. A dictionary of extension functions could be turned into some kind of module, right?
The way I see it, we should actually be able to associate extension functions and elements with a namespace. A
Extension elements are a separate story again, right? Can extension functions have a namespace?
I won't allow anything like the libxml2-style APIs near the lxml API. :) I'm sure there are useful concepts in there, but I first want to tackle XPath. Then we'll look at XSLT. If we keep the XPath API as minimal and simple as we can (for the Python developer using lxml), we should be able to translate some of those concepts into the XSLT API. Anyway, I'll skip the XSLT part for now until we've implemented an XPath API. I think I'll sit down with your patch sometime soon and try to build up the API we've sketched here, at least in basic form. Then hopefully you can give feedback on how to tackle the tricky bits, of which there are many. [snip rest of long mail which I'll look at later; I see there's some XPath stuff mixed in that I need to read too] Regards, Martijn
participants (2)
-
Marc-Antoine Parent
-
Martijn Faassen