Mailman 3 [lxml-dev] memory management strategies - lxml - The Python XML Toolkit

vng1＠mac.com

October 2004

5:08 p.m.

Hi all, I think the memory management issue is actually pretty easy to solve. Here's the case for an xmlNode: Wrap the c-xmlNode with a Python xmlNode. Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper. The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it. The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode. If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc. Shouldn't that cover it? I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have vic --- Don't be humble ... you're not that great. -- Golda Meir

Reply

Sign in to reply online Use email software

Martijn Faassen

6:32 p.m.

vng1@mac.com wrote:

...

I think the memory management issue is actually pretty easy to solve.

Here's the case for an xmlNode:

Wrap the c-xmlNode with a Python xmlNode.

Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper.

The 1 and only 1 requirement does make things significantly simpler, I think. Basically all the factories first look whether the node to be wrapped is already known somewhere else. What kind of datastructure can one use to efficiently check whether we already know a C-level node? I'd prefer to use a hashtable, though a simple implementation could just walk through all nodes in a list/table or something like that.

...

The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it.

So you mean a weakref dictionary to node, proxies, not the underlying C nodes, right?

...

The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in.

If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode.

What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed. There's a problem here if the xmlNode actually has a descendant that does have a proxy pointing to it. You can't deallocate the fragment until *nothing* points to it anymore from Python.

...

If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc.

You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated. How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.

...

Shouldn't that cover it?

If you can answer my questions, yes. :) If we had a fast way to check whether any given subtree has any proxies pointing to this then this algorithm sounds feasible.

...

I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have

Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access. Regards, Martijn

Reply

Sign in to reply online Use email software

Martijn Faassen

5:43 p.m.

vng1@mac.com wrote:

...

On 1-Oct-04, at 02:32 PM, Martijn Faassen wrote:

...
...
The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode.

What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed.

I'm a little confused here - what's an c-xmlNode doing with a pointer to a c-xmlDoc if the node is not attached to the tree?

...

Is there really a way to do this?

Sure, if you're implementing a W3C DOM API for instance, then this is a regular pattern. You have things like 'createElementNS' on the Document node, which creates an element 'in' the document but not being in the tree yet. In the Python ElementTree API it is possible to create elements 'outside' any document, which is another interesting problem. I tried to solve this by putting them all in their own document at first, and then move them into the other tree whenever they get added the other document. This had some other problems (migrating nodes between documents lead to issue) which I think I've now solved. Anyway, in the libxml2 tree API you could create such a in-document but not-in-tree in a large number of ways. xmlNewDocNode() or xmlCopyNode() for instance.

...

...
...
If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc.

You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated.

How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.

I'm honestly not very concerned about the performance of doing the freewalk at this time. I'll just assume that I or someone else is clever enough to solve this problem later. :)

I'm a bit concerned to have to do a treewalk whenever a node proxy goes out of scope and gets refcounted. This could be easily the case in some loops, I think. I do have some ideas about maintaining a hash table where each C-node id doesn't only reference the node proxy that proxies for it, but also has a reference to all node proxies which are in that subtree. That should be maintainable fairly efficiently as this can be updated by a 'walk up through all my parents' each time a node gets added to the tree or is moved.

...

...
Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access.

Thanks, I may take you up on that, but for now I just want to get my xml library finished - when I have something that's not embarassing to show - I'll try to get access to lxml SVN.

Okay, let me know if there's any library available to take a look at. Regards, Martijn

Reply

Sign in to reply online Use email software

Fred Drake

1:37 p.m.

On Thu, 14 Oct 2004 13:37:42 +0200, Martijn Faassen <faassen@infrae.com> wrote:

...

Speaking of which, I'd love to see APIs for XPath and XSLT transformations. ;-) That's more important to me than mutable DOMs. I'm afraid I'm still largely lost in the libxml2 C APIs, though. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Reply

Sign in to reply online Use email software

Victor Ng

3:27 p.m.

I think I got it working last night actually. I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :) I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot! vic --- Don't be humble ... you're not that great. -- Golda Meir On 14-Oct-04, at 07:37 AM, Martijn Faassen wrote:

...

Reply

Sign in to reply online Use email software

Martijn Faassen

2:14 a.m.

Fred Drake wrote:

...

If you want checkin rights for lxml, check with Philipp (see mail to Victor). I'll trust you guys to make something nice out of it. :) Regards, Martijn

Reply

Sign in to reply online Use email software

Martijn Faassen

2:09 a.m.

Victor Ng wrote:

...

I think I got it working last night actually.

Great!

...

I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :)

Well understood; I'll be getting married in about a week's time (the 23rd), and I won't be online much for a while as a result. :)

...

I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!

Cool! If you want checkin rights for lxml check with Philipp von Weitershausen (philipp@weitershausen.de) and tell him I said okay. And then have your way adding or changing things in lxml if you like, and perhaps I'll have a nice wedding present when I get back in about three weeks time. ;) Regards, Martijn

Reply

Sign in to reply online Use email software

vng1＠mac.com

October 2004

5:08 p.m.

Hi all, I think the memory management issue is actually pretty easy to solve. Here's the case for an xmlNode: Wrap the c-xmlNode with a Python xmlNode. Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper. The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it. The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode. If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc. Shouldn't that cover it? I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have vic --- Don't be humble ... you're not that great. -- Golda Meir

Reply

Sign in to reply online Use email software

Martijn Faassen

6:32 p.m.

vng1@mac.com wrote:

...

I think the memory management issue is actually pretty easy to solve.

Here's the case for an xmlNode:

Wrap the c-xmlNode with a Python xmlNode.

Make sure that references to a c-xmlNode always have 1 and only 1 xmlNode Python wrapper. Manage this stuff using a flyweight/factory thing where we pass in a c-xmlNode and get back a Python xmlNode wrapper.

The 1 and only 1 requirement does make things significantly simpler, I think. Basically all the factories first look whether the node to be wrapped is already known somewhere else. What kind of datastructure can one use to efficiently check whether we already know a C-level node? I'd prefer to use a hashtable, though a simple implementation could just walk through all nodes in a list/table or something like that.

...

The factory should keep track of xmlNode's in a weak reference dictionary so that we always have some reference to an xmlNode if anyone else has a reference to it.

So you mean a weakref dictionary to node, proxies, not the underlying C nodes, right?

...

The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in.

If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode.

What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed. There's a problem here if the xmlNode actually has a descendant that does have a proxy pointing to it. You can't deallocate the fragment until *nothing* points to it anymore from Python.

...

If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc.

You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated. How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.

...

Shouldn't that cover it?

If you can answer my questions, yes. :) If we had a fast way to check whether any given subtree has any proxies pointing to this then this algorithm sounds feasible.

...

I've got a substantial part of this working in my svn repository - I'm just going to tidy it up and I'll release it in the next day or two. Unfortunately - I don't have

Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access. Regards, Martijn

Reply

Sign in to reply online Use email software

Martijn Faassen

5:43 p.m.

vng1@mac.com wrote:

...

On 1-Oct-04, at 02:32 PM, Martijn Faassen wrote:

...
...
The xmlNode implements a __dealloc__ method in Pyrex to free the xmlNode when the garbage collector kicks in. If the xmlNode does not belong to any xmlDoc, then we free the c-xmlNode.

What does 'belonging to an xmlDoc' mean in this case? The situation of a detached node that nonetheless has a pointer to the xmlDoc exists. It's just not attached to the tree, and could safely be removed.

I'm a little confused here - what's an c-xmlNode doing with a pointer to a c-xmlDoc if the node is not attached to the tree?

...

Is there really a way to do this?

Sure, if you're implementing a W3C DOM API for instance, then this is a regular pattern. You have things like 'createElementNS' on the Document node, which creates an element 'in' the document but not being in the tree yet. In the Python ElementTree API it is possible to create elements 'outside' any document, which is another interesting problem. I tried to solve this by putting them all in their own document at first, and then move them into the other tree whenever they get added the other document. This had some other problems (migrating nodes between documents lead to issue) which I think I've now solved. Anyway, in the libxml2 tree API you could create such a in-document but not-in-tree in a large number of ways. xmlNewDocNode() or xmlCopyNode() for instance.

...

...
...
If the xmlNode belongs to an xmlDoc, then we need to see if any children of the c-xmlDoc have an associated Python wrapper. If there is at least 1 Python wrapper, then we can't free the document. If there are no Python xmlNode wrappers for any c-xmlNode's - then we free each of the c-xmlNodes and then we free the c-xmlDoc.

You can generalize this to each fragment; you can have each node in a fragment pointing to the fragment top, which might be the document. Of course there are issues if you move (part of) a fragment to another fragment; all proxies would need to be updated.

How do you efficiently check whether there are no more proxies pointing to C-level nodes? A complete freewalk each time a local proxy goes out of scope would be bad.

I'm honestly not very concerned about the performance of doing the freewalk at this time. I'll just assume that I or someone else is clever enough to solve this problem later. :)

I'm a bit concerned to have to do a treewalk whenever a node proxy goes out of scope and gets refcounted. This could be easily the case in some loops, I think. I do have some ideas about maintaining a hash table where each C-node id doesn't only reference the node proxy that proxies for it, but also has a reference to all node proxies which are in that subtree. That should be maintainable fairly efficiently as this can be updated by a 'walk up through all my parents' each time a node gets added to the tree or is moved.

...

...
Would you like access to the lxml svn? If so, please drop me a mail and I'll get you in touch with Philipp von Weitershausen who will be able to give you access.

Thanks, I may take you up on that, but for now I just want to get my xml library finished - when I have something that's not embarassing to show - I'll try to get access to lxml SVN.

Okay, let me know if there's any library available to take a look at. Regards, Martijn

Reply

Sign in to reply online Use email software

Fred Drake

October 2004

1:37 p.m.

On Thu, 14 Oct 2004 13:37:42 +0200, Martijn Faassen <faassen@infrae.com> wrote:

...

Speaking of which, I'd love to see APIs for XPath and XSLT transformations. ;-) That's more important to me than mutable DOMs. I'm afraid I'm still largely lost in the libxml2 C APIs, though. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> Zope Corporation

Reply

Sign in to reply online Use email software

Victor Ng

3:27 p.m.

I think I got it working last night actually. I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :) I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot! vic --- Don't be humble ... you're not that great. -- Golda Meir On 14-Oct-04, at 07:37 AM, Martijn Faassen wrote:

...

Reply

Sign in to reply online Use email software

Martijn Faassen

2:14 a.m.

Fred Drake wrote:

...

If you want checkin rights for lxml, check with Philipp (see mail to Victor). I'll trust you guys to make something nice out of it. :) Regards, Martijn

Reply

Sign in to reply online Use email software

Martijn Faassen

October 2004

2:09 a.m.

Victor Ng wrote:

...

I think I got it working last night actually.

Great!

...

I've been _extremely_ busy with work and personal stuff lately. I just bought a house this week. Code sometimes takes a backseat to 'real life'. :)

Well understood; I'll be getting married in about a week's time (the 23rd), and I won't be online much for a while as a result. :)

...

I'll post more details this weekend, but I've got xmlDoc, xmlNode and basic XPath working with garbage collection.. woot!

Cool! If you want checkin rights for lxml check with Philipp von Weitershausen (philipp@weitershausen.de) and tell him I said okay. And then have your way adding or changing things in lxml if you like, and perhaps I'll have a nice wedding present when I get back in about three weeks time. ;) Regards, Martijn

Reply

Sign in to reply online Use email software

[lxml-dev] memory management strategies

Martijn Faassen

vng1＠mac.com

vng1＠mac.com

Martijn Faassen

vng1＠mac.com

Martijn Faassen

Martijn Faassen

Fred Drake

Martijn Faassen

Victor Ng

Fred Drake

Martijn Faassen

Fred Drake

Martijn Faassen

vng1＠mac.com

vng1＠mac.com

Martijn Faassen

vng1＠mac.com

Martijn Faassen

Martijn Faassen

Fred Drake

Martijn Faassen

Victor Ng

Fred Drake

Martijn Faassen

Fred Drake

Martijn Faassen

tags

participants (4)