
Dieter Maurer, 23.06.2012 12:02:
I am working on an integration of `lxml` and `libxmlsec` (the XML security library)
Cool. I'm sure that a lot of people will be happy about this.
and I have hit an important problem: `libxmlsec` functions can change the libxml2 document (tree) and thereby seriously confuse `lxml`.
I can imagine. lxml's speed is built upon a couple of assumptions about the tree, including that it can figure out when a tree must be discarded from memory based on the Python proxy Elements it finds in it.
The major problem is that `libxmlsec` may unlink and release subtrees leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees. Fortunately, `libxmlsec` can be told not to release unlinked subtrees but leave that to the application.
Hmm - but if they are getting unlinked from the tree, how do you find them? Does libxmlsec have a callback for this?
But now, my application must do that: release the subtree if and only if `lxml` will not do that at a later time (because it has a reference to some node in the subtree). Looking at the public `lxml` API, I have not found such a function.
The public C-API of lxml is mostly grown based on the needs of lxml.objectify. It may eventually grow further based on other requirements. It's definitely not carved in stone.
I have come up with the following first version of an `lxml_safe_release`:
cdef int lxml_safe_release(_Document doc, xmlNode* c_node) except -1: # we let `lxml` get rid of the subtree by wrapping *c_node* into a # proxy and then releasing it again. if elementFactory(doc, c_node) == NULL: return -1 return 0
That won't work. What you essentially do here is: you create a proxy Element for the C node (which may (rarely!) fail and raise an exception). If it succeeds, you test the result against NULL, which it never is. Actually, Cython will refuse to compile this because NULL is incompatible with a Python object. The reason why the above prevents a segfault is that it makes sure there is at least one Python object wrapper for the subtree by creating one and discarding it immediately at function exit. The last one of the proxies in that subtree will then trigger the deallocation as usual. What you want instead is the function "attemptDeallocation()" in proxy.pxi. I think it makes sense to make it part of the public API. It's pretty much a do-what-I-mean kind of tool.
I hope that this will be sufficient to prevent SIGSEGV. However, I doubt that it is already enough that references into unlinked subtrees really work correctly. In similar situations, `lxml` calls `moveNodeToDocument` in order to get namespace references inside the unlinked subtree self contained. `moveNodeToDocument` is not public and far to complicated that I would like to include a copy in my code.
Yes, it's been tuned quite a bit. And yes, if the subtree stays alive due to a proxy held by the user, then the unlinked subtree needs to be fixed up.
I propose that future `lxml` versions should include a public `safe_release` function for such purposes.
Maybe a new "removeNodeFromDocument()" API function could first check for proxies, and then either deallocate or fix up the tree to be stand-alone.
Another, but less serious problem: some `libxmlsec` functions replace a node inside the tree (e.g. a node is replaced by an `EncryptedData` node representing the node in an encrypted form). It would be nice if I could "retarget" an `lxml` proxy referencing the replaced node to point to the replacing node. This way, `lxml` objects with references to the proxy would see the new state rather then the confusing picture resulting from the proxy now refering to an unlinked node.
If you have the proxy Element in your hands, you can call _unregisterProxy() and _registerProxy() - but that's only from inside of lxml. I'm not sure I want something like this at the public API layer. It's really low, low level functionality. That being said, I don't think you really want this, see below.
Of course, the "retarget"ing is not trivial. It is not sufficient to give the proxy a new "_c_node"; its class, too, might need to be adapted. This were possible as long as the two classes had the same "C" layout for their objects. Is `lxml` supposed to support proxy classes with differing "C" layout (I expect "yes" as answer).
From the POV of lxml the proxy is just a reference to an object of type (or subtype of) _Element. The problem is that the user most likely holds another reference to it, and there is no way we can exchange the object (or even its class) that that reference points to. These things are a lot less trivial at the C level than in Python (and even there they can have surprising side effects).
For the moment, I will tell the user of my `libxmlsec` binding: forget any `lxml` reference into an encrypted or decrypted document, including a reference to its root tree and always rebuild references from the operation's return value.
Basically, what this means is that Elements that the user holds a reference to won't change during the transformation but may no longer be at their original place afterwards. Perfectly reasonable if you ask me, because changing the tree is the whole point of doing that transformation. The same happens in XInclude, for example. Or even just when you change the tag name of an Element. None of those cases replaces the implementation of an Element that the user holds. After all, he or she could still need the original Element for some reason. Stefan