[lxml-dev] Tempory data attached to custom subclasses

I've been using the custom subclasses capability of lxml. It's slick. I do, however, miss the ability to attach temporary data to the ElementBase subclasses. (see the warnings under "Element initialization" at http://codespeak.net/lxml/element_classes.html) I can, as suggested by the docs, add attributes or children to the underlying etree.Element, but that means that I'd have to strip that temporary data off when I want to serialize the tree. (please stop me if you've already heard this request, or if there is another solution.) I'd have a solution (see below) to this need if I could get a value, say an ID, (1) that is unique to each node and (2) that does not change during the existence of the ElementTree. Note that this "ID" does not have to be meaningful, and does not need to enable me to do anything with the underlying XML object (other than re-identify it). If I could get this opaque ID (or whatever it might be called), then I could use a dictionary and something like the following to store and retrieve temporary data:: Datadict1 = {} def get_temp_data(node, datadict): id = node.get_opaque_id() if id in datadict: return datadict[id] else: data = {} datadict[id] = data return data def test(): doc = lxml.parse('somedoc.xml') root = doc.getroot() node = root[0] data = get_temp_data(node, Datadict1) value1 = 'some temporary data' data['key1'] = value1 o o o data = get_temp_data(node, Datadict1) print data['key1'] test() Looking at lxml-2.2.4/src/lxml/lxml.etree.pyx, it seems like that would be a trivial function to add. (see below) What do you think? It's pretty simple solution. Has it be tried or rejected already? Here is a patch that seems to add the necessary function. This function returns the C pointer to the libxml2 object that is underneath the lxml/etree object. Am I right that this value would be (1) unique and (2) persistent across the lifetime of the lxml/etree ElementTree? Index: lxml.etree.pyx =================================================================== --- lxml.etree.pyx (revision 71999) +++ lxml.etree.pyx (working copy) @@ -1185,6 +1185,21 @@ return None return _elementFactory(self._doc, c_node) + def getopaqueid(self): + u"""getopaqueid(self) + + Returns an opaque ID for the underlying XML C node. This + opaque ID is guaranteed (1) to be unique to each node + and (2) not to change during the existence of the + ElementTree. + """ + cdef xmlNode* c_node + cdef int intnode + c_node = self._c_node + intnode = <int>c_node + opaqueid = intnode + return opaqueid + def getnext(self): u"""getnext(self) - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman

Dave Kuhlman, 29.03.2010 23:48:
As long as your tree doesn't change, the easiest solution is to keep a reference to all Elements ("list(root.iter())") and then just store the data in the proxy instances. They are guaranteed not to change as long as there is a live reference to them. If your tree changes, you can still try to add new Elements to your keep-alive list to get the same behaviour, but you may need to take a little more care when you remove elements, so that you only remove them from the keep-alive list when you are sure they'll get discarded.
I usually suggest using the generated XPath of the element: http://codespeak.net/lxml/xpathxslt.html#generating-xpath-expressions But that's certainly more expensive than just returning a Py_ssize_t value. Stefan

Stefan - Thanks for this suggestion. The keep-alive list/set seems like a good solution for my needs. Another point about this -- The documentation you point at has the following in section titled "Element initialization": "There is one thing to know up front. Element classes must not have an __init___ or __new__ method. There should not be any internal state either, except for the data stored in the underlying XML tree." The above suggests that there is no solution such as the one you suggest. And so, someone like me, with a little less brain-power, is unlikely to think of that solution. You might want to add your two paragraphs (above) or something like the following: "If you really must store temporary data on an element that you do not want serialized, then you should put any nodes which must be persistent on a keep-alive list (or other container), since they are guaranteed not to change as long as there is a live reference to them." Something like that might save you from having to answer this question yet again at some time in the future. And, a last point: for some purposes, instead of: keep_alive = list(root.iter()) the following might be better: keep_alive = set(root.iterdescendants()) keep_alive.add(root) because: 1. iterdescendents() plus adding root puts all nodes into keep_alive. 2. A set should give faster look-up, check for membership, etc. Thanks again for your help with this. And, thanks even more for Lxml. It's a super tool. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman

Dave Kuhlman, 01.04.2010 22:41:
Thanks, I'll add something like that to the docs.
Then that shouldn't be any different from keep_alive = set(root.iter()) The only reason why there *is* an iterdescendants() is that iter() yields all nodes in the subtree, including the root itself. Stefan

On Thu, Apr 01, 2010 at 11:14:08PM +0200, Stefan Behnel wrote:
Stefan - You are right. My mistake. I thought I had done a test with iter(), but I must have confused myself somehow. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman

Dave Kuhlman, 29.03.2010 23:48:
As long as your tree doesn't change, the easiest solution is to keep a reference to all Elements ("list(root.iter())") and then just store the data in the proxy instances. They are guaranteed not to change as long as there is a live reference to them. If your tree changes, you can still try to add new Elements to your keep-alive list to get the same behaviour, but you may need to take a little more care when you remove elements, so that you only remove them from the keep-alive list when you are sure they'll get discarded.
I usually suggest using the generated XPath of the element: http://codespeak.net/lxml/xpathxslt.html#generating-xpath-expressions But that's certainly more expensive than just returning a Py_ssize_t value. Stefan

Stefan - Thanks for this suggestion. The keep-alive list/set seems like a good solution for my needs. Another point about this -- The documentation you point at has the following in section titled "Element initialization": "There is one thing to know up front. Element classes must not have an __init___ or __new__ method. There should not be any internal state either, except for the data stored in the underlying XML tree." The above suggests that there is no solution such as the one you suggest. And so, someone like me, with a little less brain-power, is unlikely to think of that solution. You might want to add your two paragraphs (above) or something like the following: "If you really must store temporary data on an element that you do not want serialized, then you should put any nodes which must be persistent on a keep-alive list (or other container), since they are guaranteed not to change as long as there is a live reference to them." Something like that might save you from having to answer this question yet again at some time in the future. And, a last point: for some purposes, instead of: keep_alive = list(root.iter()) the following might be better: keep_alive = set(root.iterdescendants()) keep_alive.add(root) because: 1. iterdescendents() plus adding root puts all nodes into keep_alive. 2. A set should give faster look-up, check for membership, etc. Thanks again for your help with this. And, thanks even more for Lxml. It's a super tool. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman

Dave Kuhlman, 01.04.2010 22:41:
Thanks, I'll add something like that to the docs.
Then that shouldn't be any different from keep_alive = set(root.iter()) The only reason why there *is* an iterdescendants() is that iter() yields all nodes in the subtree, including the root itself. Stefan

On Thu, Apr 01, 2010 at 11:14:08PM +0200, Stefan Behnel wrote:
Stefan - You are right. My mistake. I thought I had done a test with iter(), but I must have confused myself somehow. - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman
participants (2)
-
Dave Kuhlman
-
Stefan Behnel