[lxml-dev] Memory management redux

Hi there, Inspired by discussions with Vic and through browsing the vlibxml2 code, I've implemented bit of memory management functionality which, after a lot of manual debugging, seems to be doing the right thing. So far... A new addition to the lxml trunk is nodereg, and associated testing stuff (noderegtest.pyx and test_nodereg.py). nodereg is a system for registering Python-level node proxies, plus some base classes for the document and node objects in a typical libxml2 tree wrapper. The nodereg module functionality can be used to make sure that memory (in particular libxml2 tree nodes) gets collected when it is possible, and not before. :) This sounds easy, but it is surprisingly tricky. Next: * look into hooking in libxml2's memory debugging functionality for testing. Investigate Vic's code in that area/get Vic's advice. * start rewriting etree, dom, or vlibxml2 to use nodereg. This will likely further evolve nodereg. * Add more functionality to nodereg. One thing that currently is not handled is attribute nodes, for instance. * Optimize nodereg. The strategy currently employed requires, in the worst case, a lot of full-tree walks to determine whether a node in the tree can be successfully garbage collected. We need to come up with some smart algorithm/datastructure to avoid this having to happen to often. Another thing I would really like to do is investigate adding weakref support to Pyrex. Right now I had to first jump through a bit of a hoop to make it work. Then later on I took a long time debugging an obscure case where there would be a remaining refcount on an object even if the only object still pointing to the object was a WeakValueDictionary. I finally traced it down to Pyrex introducing this. I'm not clear why, but somehow the base class got involved (which was not weakreferenceable as defined by Pyrex). This somehow managed to trick the object into keeping a reference while it shouldn't, causing it never to be deallocated. Being able to just say 'this class can be weakreferenced' in Pyrex should make this go away. Regards, Martijn

Hi there, Replying to myself, for the undoubtedly vast crowd who is checking up on the progress every minute.. :) Martijn Faassen wrote: [snip]
Next:
[snip]
* start rewriting etree, dom, or vlibxml2 to use nodereg. This will likely further evolve nodereg.
I've adapted dom.pyx to use the nodereg system. This indeed flushed out a few more bugs in nodereg that I tracked down and killed. Nodereg has also evolved to account for the case where there are two different Python-level proxies that both apply to the same underlying C node. This is possible for instance in the case of DOM's .attributes and .children attributes, and a few Python iterators that were defined on it as well. dom.pyx is starting to grow a few DOM manipulation methods -- they're far from done yet and there are far more, but it's a start, and helps with memory debugging. I'll likely start looking into porting over etree onto the new system as well.
* Add more functionality to nodereg. One thing that currently is not handled is attribute nodes, for instance.
This has now also been fixed; attributes should be properly supported.
I'll see about this if it starts bothering me (or other people). Worrying too much about this now (which I had been doing) is probably premature optimization; thanks Vic for giving me that insight when you said you weren't worrying about this. :) I'm hopeful that the worst case (full tree scans happening a lot) is actually not happening as often as I feared in common usage cases.
After being frustrated again by it because I had to reorganize dom.pyx, I've since posted on the pyrex list noting my desire for weakref support. Regards, Martijn

Hi there, Replying to myself, for the undoubtedly vast crowd who is checking up on the progress every minute.. :) Martijn Faassen wrote: [snip]
Next:
[snip]
* start rewriting etree, dom, or vlibxml2 to use nodereg. This will likely further evolve nodereg.
I've adapted dom.pyx to use the nodereg system. This indeed flushed out a few more bugs in nodereg that I tracked down and killed. Nodereg has also evolved to account for the case where there are two different Python-level proxies that both apply to the same underlying C node. This is possible for instance in the case of DOM's .attributes and .children attributes, and a few Python iterators that were defined on it as well. dom.pyx is starting to grow a few DOM manipulation methods -- they're far from done yet and there are far more, but it's a start, and helps with memory debugging. I'll likely start looking into porting over etree onto the new system as well.
* Add more functionality to nodereg. One thing that currently is not handled is attribute nodes, for instance.
This has now also been fixed; attributes should be properly supported.
I'll see about this if it starts bothering me (or other people). Worrying too much about this now (which I had been doing) is probably premature optimization; thanks Vic for giving me that insight when you said you weren't worrying about this. :) I'm hopeful that the worst case (full tree scans happening a lot) is actually not happening as often as I feared in common usage cases.
After being frustrated again by it because I had to reorganize dom.pyx, I've since posted on the pyrex list noting my desire for weakref support. Regards, Martijn
participants (1)
-
Martijn Faassen