Support integration with other tree changing libxml2 based libraries
I am working on an integration of `lxml` and `libxmlsec` (the XML security library) and I have hit an important problem: `libxmlsec` functions can change the libxml2 document (tree) and thereby seriously confuse `lxml`. The major problem is that `libxmlsec` may unlink and release subtrees leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees. Fortunately, `libxmlsec` can be told not to release unlinked subtrees but leave that to the application. But now, my application must do that: release the subtree if and only if `lxml` will not do that at a later time (because it has a reference to some node in the subtree). Looking at the public `lxml` API, I have not found such a function. I have come up with the following first version of an `lxml_safe_release`: cdef int lxml_safe_release(_Document doc, xmlNode* c_node) except -1: # we let `lxml` get rid of the subtree by wrapping *c_node* into a # proxy and then releasing it again. if elementFactory(doc, c_node) == NULL: return -1 return 0 I hope that this will be sufficient to prevent SIGSEGV. However, I doubt that it is already enough that references into unlinked subtrees really work correctly. In similar situations, `lxml` calls `moveNodeToDocument` in order to get namespace references inside the unlinked subtree self contained. `moveNodeToDocument` is not public and far to complicated that I would like to include a copy in my code. I propose that future `lxml` versions should include a public `safe_release` function for such purposes. Another, but less serious problem: some `libxmlsec` functions replace a node inside the tree (e.g. a node is replaced by an `EncryptedData` node representing the node in an encrypted form). It would be nice if I could "retarget" an `lxml` proxy referencing the replaced node to point to the replacing node. This way, `lxml` objects with references to the proxy would see the new state rather then the confusing picture resulting from the proxy now refering to an unlinked node. Of course, the "retarget"ing is not trivial. It is not sufficient to give the proxy a new "_c_node"; its class, too, might need to be adapted. This were possible as long as the two classes had the same "C" layout for their objects. Is `lxml` supposed to support proxy classes with differing "C" layout (I expect "yes" as answer). For the moment, I will tell the user of my `libxmlsec` binding: forget any `lxml` reference into an encrypted or decrypted document, including a reference to its root tree and always rebuild references from the operation's return value.
Dieter Maurer, 23.06.2012 12:02:
I am working on an integration of `lxml` and `libxmlsec` (the XML security library)
Cool. I'm sure that a lot of people will be happy about this.
and I have hit an important problem: `libxmlsec` functions can change the libxml2 document (tree) and thereby seriously confuse `lxml`.
I can imagine. lxml's speed is built upon a couple of assumptions about the tree, including that it can figure out when a tree must be discarded from memory based on the Python proxy Elements it finds in it.
The major problem is that `libxmlsec` may unlink and release subtrees leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees. Fortunately, `libxmlsec` can be told not to release unlinked subtrees but leave that to the application.
Hmm - but if they are getting unlinked from the tree, how do you find them? Does libxmlsec have a callback for this?
But now, my application must do that: release the subtree if and only if `lxml` will not do that at a later time (because it has a reference to some node in the subtree). Looking at the public `lxml` API, I have not found such a function.
The public C-API of lxml is mostly grown based on the needs of lxml.objectify. It may eventually grow further based on other requirements. It's definitely not carved in stone.
I have come up with the following first version of an `lxml_safe_release`:
cdef int lxml_safe_release(_Document doc, xmlNode* c_node) except -1: # we let `lxml` get rid of the subtree by wrapping *c_node* into a # proxy and then releasing it again. if elementFactory(doc, c_node) == NULL: return -1 return 0
That won't work. What you essentially do here is: you create a proxy Element for the C node (which may (rarely!) fail and raise an exception). If it succeeds, you test the result against NULL, which it never is. Actually, Cython will refuse to compile this because NULL is incompatible with a Python object. The reason why the above prevents a segfault is that it makes sure there is at least one Python object wrapper for the subtree by creating one and discarding it immediately at function exit. The last one of the proxies in that subtree will then trigger the deallocation as usual. What you want instead is the function "attemptDeallocation()" in proxy.pxi. I think it makes sense to make it part of the public API. It's pretty much a do-what-I-mean kind of tool.
I hope that this will be sufficient to prevent SIGSEGV. However, I doubt that it is already enough that references into unlinked subtrees really work correctly. In similar situations, `lxml` calls `moveNodeToDocument` in order to get namespace references inside the unlinked subtree self contained. `moveNodeToDocument` is not public and far to complicated that I would like to include a copy in my code.
Yes, it's been tuned quite a bit. And yes, if the subtree stays alive due to a proxy held by the user, then the unlinked subtree needs to be fixed up.
I propose that future `lxml` versions should include a public `safe_release` function for such purposes.
Maybe a new "removeNodeFromDocument()" API function could first check for proxies, and then either deallocate or fix up the tree to be stand-alone.
Another, but less serious problem: some `libxmlsec` functions replace a node inside the tree (e.g. a node is replaced by an `EncryptedData` node representing the node in an encrypted form). It would be nice if I could "retarget" an `lxml` proxy referencing the replaced node to point to the replacing node. This way, `lxml` objects with references to the proxy would see the new state rather then the confusing picture resulting from the proxy now refering to an unlinked node.
If you have the proxy Element in your hands, you can call _unregisterProxy() and _registerProxy() - but that's only from inside of lxml. I'm not sure I want something like this at the public API layer. It's really low, low level functionality. That being said, I don't think you really want this, see below.
Of course, the "retarget"ing is not trivial. It is not sufficient to give the proxy a new "_c_node"; its class, too, might need to be adapted. This were possible as long as the two classes had the same "C" layout for their objects. Is `lxml` supposed to support proxy classes with differing "C" layout (I expect "yes" as answer).
From the POV of lxml the proxy is just a reference to an object of type (or subtype of) _Element. The problem is that the user most likely holds another reference to it, and there is no way we can exchange the object (or even its class) that that reference points to. These things are a lot less trivial at the C level than in Python (and even there they can have surprising side effects).
For the moment, I will tell the user of my `libxmlsec` binding: forget any `lxml` reference into an encrypted or decrypted document, including a reference to its root tree and always rebuild references from the operation's return value.
Basically, what this means is that Elements that the user holds a reference to won't change during the transformation but may no longer be at their original place afterwards. Perfectly reasonable if you ask me, because changing the tree is the whole point of doing that transformation. The same happens in XInclude, for example. Or even just when you change the tag name of an Element. None of those cases replaces the implementation of an Element that the user holds. After all, he or she could still need the original Element for some reason. Stefan
Stefan Behnel <stefan_ml@behnel.de> writes:
Dieter Maurer, 23.06.2012 12:02: ...
I propose that future `lxml` versions should include a public `safe_release` function for such purposes.
Maybe a new "removeNodeFromDocument()" API function could first check for proxies, and then either deallocate or fix up the tree to be stand-alone.
That would be ideal.
...
Another, but less serious problem: some `libxmlsec` functions replace a node inside the tree (e.g. a node is replaced by an `EncryptedData` node representing the node in an encrypted form). It would be nice if I could "retarget" an `lxml` proxy referencing the replaced node to point to the replacing node. This way, `lxml` objects with references to the proxy would see the new state rather then the confusing picture resulting from the proxy now refering to an unlinked node. ... Of course, the "retarget"ing is not trivial. It is not sufficient to give the proxy a new "_c_node"; its class, too, might need to be adapted. This were possible as long as the two classes had the same "C" layout for their objects. Is `lxml` supposed to support proxy classes with differing "C" layout (I expect "yes" as answer).
From the POV of lxml the proxy is just a reference to an object of type (or subtype of) _Element. The problem is that the user most likely holds another reference to it
This means, one cannot replace the proxy object by a new one but one could change the proxy object content (e.g. set a new "_c_node", set a new "__class__"). As I understood, "lxml" ensures that there is at most one proxy for any given "c_node" (by putting a proxy reference into the "_private" of the "c_node"). Thereby, changing the proxy content changes all "views" of the "lxml" application on the respectice "c_node".
and there is no way we can exchange the object (or even its class) that that reference points to. These things are a lot less trivial at the C level than in Python (and even there they can have surprising side effects).
I am not sure that I understand your argument (though I fully appreciate your reluctance to provide a public API). In my case, I am not inside a complicated `lxml` context where `lxml` code could hold direct references to internal attributes of the proxy I want to retarget. The only such references are in my binding function -- and of course, I must ensure that they do not get confused.
For the moment, I will tell the user of my `libxmlsec` binding: forget any `lxml` reference into an encrypted or decrypted document, including a reference to its root tree and always rebuild references from the operation's return value.
Basically, what this means is that Elements that the user holds a reference to won't change during the transformation but may no longer be at their original place afterwards.
The worst behaviour I have observed: doc = parse(StringIO("<?...><!-- ... --><Envelope>...</Envelope>")) encrypt(..., doc.getroot()) print tostring(doc) <Envelope>...</Envelope> That means that encrypting the root node of an "_ElementTree" has stripped this tree of its processing instruction and its comment. I understand why this happens but from a user perspective, it can be really surprising.
Perfectly reasonable if you ask me, because changing the tree is the whole point of doing that transformation. The same happens in XInclude, for example. Or even just when you change the tag name of an Element. None of those cases replaces the implementation of an Element that the user holds. After all, he or she could still need the original Element for some reason.
As the example above shows, he neither sees the original nor the new element.
Stefan Behnel <stefan_ml@behnel.de> writes:
Dieter Maurer, 23.06.2012 12:02: ...
The major problem is that `libxmlsec` may unlink and release subtrees leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees. Fortunately, `libxmlsec` can be told not to release unlinked subtrees but leave that to the application.
Hmm - but if they are getting unlinked from the tree, how do you find them? Does libxmlsec have a callback for this?
I have not answered this question in my previous response because I thought it not relevant -- but maybe, I have been wrong. `libxmlsec` does not provide a callback but it provides an option that instead of doing the release itself internally it makes the unlinked subtrees available on the (context) object controlling the operation. The consequence: these subtrees are already unlinked; they still have their `doc` reference but they already lost their `parent` reference (and likely other references related to their former position in the tree). And one more (most likely irrelevant) detail: `libxmlsec` provides the unlinked subtrees in the form of an `xmlNodeList *`, i.e. the `next` pointer of the individual subtree roots may still point somewhere (as part of the node list).
participants (2)
-
Dieter Maurer
-
Stefan Behnel