Re: [lxml] Support integration with other tree changing libxml2 based libraries

June 23, 2012

      Dieter Maurer, 23.06.2012 12:02:
...
I am working on an integration of `lxml` and `libxmlsec` (the
XML security library)
Cool. I'm sure that a lot of people will be happy about this.
...
and I have hit an important problem:
`libxmlsec` functions can change the libxml2 document (tree)
and thereby seriously confuse `lxml`.
I can imagine. lxml's speed is built upon a couple of assumptions about the
tree, including that it can figure out when a tree must be discarded from
memory based on the Python proxy Elements it finds in it.
...
The major problem is that `libxmlsec` may unlink and release subtrees
leading to a `SIGSEGV` in `lxml` code when it later accesses those subtrees.
Fortunately, `libxmlsec` can be told not to release unlinked
subtrees but leave that to the application.
Hmm - but if they are getting unlinked from the tree, how do you find them?
Does libxmlsec have a callback for this?
...
But now, my application
must do that: release the subtree if and only if `lxml` will not do
that at a later time (because it has a reference to some node in the subtree).
Looking at the public `lxml` API, I have not found
such a function.
The public C-API of lxml is mostly grown based on the needs of
lxml.objectify. It may eventually grow further based on other requirements.
It's definitely not carved in stone.
...
I have come up with the following first version
of an `lxml_safe_release`:
cdef int lxml_safe_release(_Document doc, xmlNode* c_node) except -1:
  # we let `lxml` get rid of the subtree by wrapping *c_node* into a
  #  proxy and then releasing it again.
  if elementFactory(doc, c_node) == NULL: return -1
  return 0
That won't work. What you essentially do here is: you create a proxy
Element for the C node (which may (rarely!) fail and raise an exception).
If it succeeds, you test the result against NULL, which it never is.
Actually, Cython will refuse to compile this because NULL is incompatible
with a Python object.

The reason why the above prevents a segfault is that it makes sure there is
at least one Python object wrapper for the subtree by creating one and
discarding it immediately at function exit. The last one of the proxies in
that subtree will then trigger the deallocation as usual.

What you want instead is the function "attemptDeallocation()" in proxy.pxi.
I think it makes sense to make it part of the public API. It's pretty much
a do-what-I-mean kind of tool.
...
I hope that this will be sufficient to prevent SIGSEGV.
However, I doubt that it is already enough that references into
unlinked subtrees really work correctly. In similar situations,
`lxml` calls `moveNodeToDocument` in order to get namespace references
inside the unlinked subtree self contained. `moveNodeToDocument` is not
public and far to complicated that I would like to include a copy
in my code.
Yes, it's been tuned quite a bit. And yes, if the subtree stays alive due
to a proxy held by the user, then the unlinked subtree needs to be fixed up.
...
I propose that future `lxml` versions should include a public
`safe_release` function for such purposes.
Maybe a new "removeNodeFromDocument()" API function could first check for
proxies, and then either deallocate or fix up the tree to be stand-alone.
...
Another, but less serious problem: some `libxmlsec` functions
replace a node inside the tree (e.g. a node is replaced by an
`EncryptedData` node representing the node in an encrypted form).
It would be nice if I could "retarget" an `lxml` proxy referencing
the replaced node to point to the replacing node. This way,
`lxml` objects with references to the proxy would see the new
state rather then the confusing picture resulting from the proxy
now refering to an unlinked node.
If you have the proxy Element in your hands, you can call
_unregisterProxy() and _registerProxy() - but that's only from inside of
lxml. I'm not sure I want something like this at the public API layer. It's
really low, low level functionality.

That being said, I don't think you really want this, see below.
...
Of course, the "retarget"ing is not trivial. It is not sufficient
to give the proxy a new "_c_node"; its class, too, might need to
be adapted. This were possible as long as the two classes
had the same "C" layout for their objects. Is `lxml` supposed
to support proxy classes with differing "C" layout (I expect "yes"
as answer).
...
From the POV of lxml the proxy is just a reference to an object of type (or
subtype of) _Element. The problem is that the user most likely holds
another reference to it, and there is no way we can exchange the object (or
even its class) that that reference points to. These things are a lot less
trivial at the C level than in Python (and even there they can have
surprising side effects).
...
For the moment, I will tell the user of my `libxmlsec` binding:
forget any `lxml` reference into an encrypted or decrypted document,
including a reference to its root tree and always rebuild
references from the operation's return value.
Basically, what this means is that Elements that the user holds a reference
to won't change during the transformation but may no longer be at their
original place afterwards. Perfectly reasonable if you ask me, because
changing the tree is the whole point of doing that transformation. The same
happens in XInclude, for example. Or even just when you change the tag name
of an Element. None of those cases replaces the implementation of an
Element that the user holds. After all, he or she could still need the
original Element for some reason.

Stefan

Re: [lxml] Support integration with other tree changing libxml2 based libraries

Stefan Behnel