[lxml-dev] memory management strategies

Hi everybody, I'm trying to figure out strategies to manage memory automatically in lxml in the context of libxml2 trees. Some of my incomplete thinking and description of the problem has been encapsulated in the following document: http://codespeak.net/svn/lxml/trunk/doc/memorymanagement.txt It's quite possible that I'm making things more complicated than necessary, as Vic said. I'm quite curious to hear Vic's and other people's thinking about this. The basic concept is that libxml2 maintains a tree representing XML. This tree can be manipulated by a host of C functions. lxml wraps this tree in two ways now. The wrapping is done by proxies written in Pyrex that typically stand in for a node in the libxml2 tree and expose informationon it. One wrapper, the one I started first, aims to provide an ElementTree API to the libxml2 tree. ElementTree is a pythonic and lightweight way to manipulate XML written by Fredrik Lundh, and lxml just tries to mimic that API as much as possible. The wrapping is currently incomplete, mostly due to the thorny memory management issues. The other wrapper which I started later provides a classical W3C DOM API. It's pretty good in the real-only department already, but as soon as I started considering writeable DOM the memory management issues came up again in full force. Help, please! Regards, Martijn

Hi all, I think the memory management issue is actually pretty easy to solve. Here's the case for an xmlNode: Wrap the c-xmlNode with a Python xmlNode. Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper. The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it. The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode. If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc. Shouldn't that cover it? I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have vic --- Don't be humble ... you're not that great. -- Golda Meir

vng1@mac.com wrote:
The 1 and only 1 requirement does make things significantly simpler, I think. Basically all the factories first look whether the node to be wrapped is already known somewhere else. What kind of datastructure can one use to efficiently check whether we already know a C-level node? I'd prefer to use a hashtable, though a simple implementation could just walk through all nodes in a list/table or something like that.
So you mean a weakref dictionary to node, proxies, not the underlying C nodes, right?
What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed. There's a problem here if the xmlNode actually has a descendant that does have a proxy pointing to it. You can't deallocate the fragment until *nothing* points to it anymore from Python.
You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated. How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.
Shouldn't that cover it?
If you can answer my questions, yes. :) If we had a fast way to check whether any given subtree has any proxies pointing to this then this algorithm sounds feasible.
Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access. Regards, Martijn

On 1-Oct-04, at 02:32 PM, Martijn Faassen wrote:
I was thinking of using a plain python dictionary and implement __hash__ on the xmlNode wrappers to return the int value of the c-xmlNode pointer.
Yes - exactly.
I'm a little confused here - what's an c-xmlNode doing with a pointer to a c-xmlDoc if the node is not attached to the tree? Is there really a way to do this?
I'm honestly not very concerned about the performance of doing the freewalk at this time. I'll just assume that I or someone else is clever enough to solve this problem later. :)
Thanks, I may take you up on that, but for now I just want to get my xml library finished - when I have something that's not embarassing to show - I'll try to get access to lxml SVN. vic

vng1@mac.com wrote:
Is there really a way to do this?
Sure, if you're implementing a W3C DOM API for instance, then this is a regular pattern. You have things like 'createElementNS' on the Document node, which creates an element 'in' the document but not being in the tree yet. In the Python ElementTree API it is possible to create elements 'outside' any document, which is another interesting problem. I tried to solve this by putting them all in their own document at first, and then move them into the other tree whenever they get added the other document. This had some other problems (migrating nodes between documents lead to issue) which I think I've now solved. Anyway, in the libxml2 tree API you could create such a in-document but not-in-tree in a large number of ways. xmlNewDocNode() or xmlCopyNode() for instance.
I'm a bit concerned to have to do a treewalk whenever a node proxy goes out of scope and gets refcounted. This could be easily the case in some loops, I think. I do have some ideas about maintaining a hash table where each C-node id doesn't only reference the node proxy that proxies for it, but also has a reference to all node proxies which are in that subtree. That should be maintainable fairly efficiently as this can be updated by a 'walk up through all my parents' each time a node gets added to the tree or is moved.
Okay, let me know if there's any library available to take a look at. Regards, Martijn

vng1@mac.com wrote: [snip]
Any progress? I won't be able to make it online a lot in the coming three weeks, but after that I hope to have some time to start pushing ahead with lxml development again. Once the memory management problem has been licked, I imagine the rest would be downhill. Just a question of selecting the right APIs to implement (hopefully we can manage not to invent too many new ones). Regards, Martijn

On Thu, 14 Oct 2004 13:37:42 +0200, Martijn Faassen <faassen@infrae.com> wrote:
Speaking of which, I'd love to see APIs for XPath and XSLT transformations. ;-) That's more important to me than mutable DOMs. I'm afraid I'm still largely lost in the libxml2 C APIs, though. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Fred Drake wrote:
Mutable DOMs are not actually the most important to me too, but I keep getting mired in them anyway. :) I started with the ElementTree API and finish it I will at some point! XPath and XSLT indeed very important to me too; I have explored DOM level 3 XPath API to see whether I could copy those, but they're rather overwhelming. ElementTree-level XPath support might be nice. Anyone know about good XPath or XSLT APIs? I'll go with Pythonic and described in some document but I'll make do with not-Pythonic and described in some document for the time being.
I'm afraid I'm still largely lost in the libxml2 C APIs, though.
I can make XPath and XSLT go from the including Python bindings, but it's not pretty. It will help me getting it to work from the C-side, as the Python side is so similar -- the thing we're trying to fix. :) Regards, Martijn

I think I got it working last night actually. I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :) I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot! vic --- Don't be humble ... you're not that great. -- Golda Meir On 14-Oct-04, at 07:37 AM, Martijn Faassen wrote:

On Thu, 14 Oct 2004 11:27:47 -0400, Victor Ng <vng1@mac.com> wrote:
I understand that one. It's hard to deal with code before the new house as networking set up. ;-)
I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!
This is wonderful news! I'll plan on spending some time testing things out too. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

On Fri, 15 Oct 2004 04:14:07 +0200, Martijn Faassen <faassen@infrae.com> wrote:
If you want checkin rights for lxml, check with Philipp (see mail to Victor).
Got those already, just haven't had any time to work on them. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Victor Ng wrote:
I think I got it working last night actually.
Great!
Well understood; I'll be getting married in about a week's time (the 23rd), and I won't be online much for a while as a result. :)
I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!
Cool! If you want checkin rights for lxml check with Philipp von Weitershausen (philipp@weitershausen.de) and tell him I said okay. And then have your way adding or changing things in lxml if you like, and perhaps I'll have a nice wedding present when I get back in about three weeks time. ;) Regards, Martijn

Hi all, I think the memory management issue is actually pretty easy to solve. Here's the case for an xmlNode: Wrap the c-xmlNode with a Python xmlNode. Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper. The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it. The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode. If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc. Shouldn't that cover it? I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have vic --- Don't be humble ... you're not that great. -- Golda Meir

vng1@mac.com wrote:
The 1 and only 1 requirement does make things significantly simpler, I think. Basically all the factories first look whether the node to be wrapped is already known somewhere else. What kind of datastructure can one use to efficiently check whether we already know a C-level node? I'd prefer to use a hashtable, though a simple implementation could just walk through all nodes in a list/table or something like that.
So you mean a weakref dictionary to node, proxies, not the underlying C nodes, right?
What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed. There's a problem here if the xmlNode actually has a descendant that does have a proxy pointing to it. You can't deallocate the fragment until *nothing* points to it anymore from Python.
You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated. How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.
Shouldn't that cover it?
If you can answer my questions, yes. :) If we had a fast way to check whether any given subtree has any proxies pointing to this then this algorithm sounds feasible.
Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access. Regards, Martijn

On 1-Oct-04, at 02:32 PM, Martijn Faassen wrote:
I was thinking of using a plain python dictionary and implement __hash__ on the xmlNode wrappers to return the int value of the c-xmlNode pointer.
Yes - exactly.
I'm a little confused here - what's an c-xmlNode doing with a pointer to a c-xmlDoc if the node is not attached to the tree? Is there really a way to do this?
I'm honestly not very concerned about the performance of doing the freewalk at this time. I'll just assume that I or someone else is clever enough to solve this problem later. :)
Thanks, I may take you up on that, but for now I just want to get my xml library finished - when I have something that's not embarassing to show - I'll try to get access to lxml SVN. vic

vng1@mac.com wrote:
Is there really a way to do this?
Sure, if you're implementing a W3C DOM API for instance, then this is a regular pattern. You have things like 'createElementNS' on the Document node, which creates an element 'in' the document but not being in the tree yet. In the Python ElementTree API it is possible to create elements 'outside' any document, which is another interesting problem. I tried to solve this by putting them all in their own document at first, and then move them into the other tree whenever they get added the other document. This had some other problems (migrating nodes between documents lead to issue) which I think I've now solved. Anyway, in the libxml2 tree API you could create such a in-document but not-in-tree in a large number of ways. xmlNewDocNode() or xmlCopyNode() for instance.
I'm a bit concerned to have to do a treewalk whenever a node proxy goes out of scope and gets refcounted. This could be easily the case in some loops, I think. I do have some ideas about maintaining a hash table where each C-node id doesn't only reference the node proxy that proxies for it, but also has a reference to all node proxies which are in that subtree. That should be maintainable fairly efficiently as this can be updated by a 'walk up through all my parents' each time a node gets added to the tree or is moved.
Okay, let me know if there's any library available to take a look at. Regards, Martijn

vng1@mac.com wrote: [snip]
Any progress? I won't be able to make it online a lot in the coming three weeks, but after that I hope to have some time to start pushing ahead with lxml development again. Once the memory management problem has been licked, I imagine the rest would be downhill. Just a question of selecting the right APIs to implement (hopefully we can manage not to invent too many new ones). Regards, Martijn

On Thu, 14 Oct 2004 13:37:42 +0200, Martijn Faassen <faassen@infrae.com> wrote:
Speaking of which, I'd love to see APIs for XPath and XSLT transformations. ;-) That's more important to me than mutable DOMs. I'm afraid I'm still largely lost in the libxml2 C APIs, though. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Fred Drake wrote:
Mutable DOMs are not actually the most important to me too, but I keep getting mired in them anyway. :) I started with the ElementTree API and finish it I will at some point! XPath and XSLT indeed very important to me too; I have explored DOM level 3 XPath API to see whether I could copy those, but they're rather overwhelming. ElementTree-level XPath support might be nice. Anyone know about good XPath or XSLT APIs? I'll go with Pythonic and described in some document but I'll make do with not-Pythonic and described in some document for the time being.
I'm afraid I'm still largely lost in the libxml2 C APIs, though.
I can make XPath and XSLT go from the including Python bindings, but it's not pretty. It will help me getting it to work from the C-side, as the Python side is so similar -- the thing we're trying to fix. :) Regards, Martijn

I think I got it working last night actually. I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :) I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot! vic --- Don't be humble ... you're not that great. -- Golda Meir On 14-Oct-04, at 07:37 AM, Martijn Faassen wrote:

On Thu, 14 Oct 2004 11:27:47 -0400, Victor Ng <vng1@mac.com> wrote:
I understand that one. It's hard to deal with code before the new house as networking set up. ;-)
I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!
This is wonderful news! I'll plan on spending some time testing things out too. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

On Fri, 15 Oct 2004 04:14:07 +0200, Martijn Faassen <faassen@infrae.com> wrote:
If you want checkin rights for lxml, check with Philipp (see mail to Victor).
Got those already, just haven't had any time to work on them. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Victor Ng wrote:
I think I got it working last night actually.
Great!
Well understood; I'll be getting married in about a week's time (the 23rd), and I won't be online much for a while as a result. :)
I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!
Cool! If you want checkin rights for lxml check with Philipp von Weitershausen (philipp@weitershausen.de) and tell him I said okay. And then have your way adding or changing things in lxml if you like, and perhaps I'll have a nice wedding present when I get back in about three weeks time. ;) Regards, Martijn
participants (4)
-
Fred Drake
-
Martijn Faassen
-
Victor Ng
-
vng1@mac.com