[lxml-dev] the reason for _fakeRootDoc?
Hi there, I'm trying to understand why in scoder2 there's the use of _fakeRootDoc in a few places. I imagine this is to work around some bugs that show up in the trunk, but I don't have enough context on what and why. Stefan, can you enlighten me? Any tests that expose this behavior in particular? Regards, Martijn
Martijn Faassen wrote:
I'm trying to understand why in scoder2 there's the use of _fakeRootDoc in a few places. I imagine this is to work around some bugs that show up in the trunk, but I don't have enough context on what and why. Stefan, can you enlighten me? Any tests that expose this behavior in particular?
It's related to the fact that the current trunk rips Elements out of their document when you wrap them in ElementTrees. This is both incompatible to ElementTree und (IMHO) unnecessary. So, what I did was: rewrite ElementTree as a light-weight wrapper around an Element and remove its internal identity with libxml2 documents. In scoder2, Elements store their own reference to their 'Document', which is the new lxml-internal Python representation of a libxml2 document. This means that Elements can now stay in their document while being wrapped in any number of ElementTrees, as ElementTrees only refers to them as a 'virtual' root within their document. The drawback is that libxml2 actions that run on ElementTrees now cannot know by themselves which element is the actual root of the ElementTree (which is called _context_node in _ElementTree), as it is no longer the root of the document. This is what _fakeRootDoc is for. It takes an arbitrary element and its document and then temporarily creates a fake copy of the document that has a copy of the element as root. So, while the original element is still in its previously (arbitrary) position, its copy now forms the root of the fake document. (It is necessary to copy it since it must not have any siblings.) libxml2/libxslt actions are then run against that fake document, which is cleaned up (i.e. deleted) immediately after the action. One little hack is that it only copies the element, it does not do deep-copying. It should therefore be possible to access higher ancestors that are not below the _context_node of the respective ElementTree via e.g. an XSLT (like via <copy-of select="ancestors::node()"/>). I didn't test that, but I believe it is not too bad, since it is a rather rare case and somehow even fits the ElementTree idea of Elements being 'free'. I know, it might sound a little confusing at first, but I hope you got the idea. Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
I'm trying to understand why in scoder2 there's the use of _fakeRootDoc in a few places. I imagine this is to work around some bugs that show up in the trunk, but I don't have enough context on what and why. Stefan, can you enlighten me? Any tests that expose this behavior in particular?
[snip]
The drawback is that libxml2 actions that run on ElementTrees now cannot know by themselves which element is the actual root of the ElementTree (which is called _context_node in _ElementTree), as it is no longer the root of the document.
Right, thanks for pointing out this consequence.
This is what _fakeRootDoc is for. It takes an arbitrary element and its document and then temporarily creates a fake copy of the document that has a copy of the element as root. So, while the original element is still in its previously (arbitrary) position, its copy now forms the root of the fake document. (It is necessary to copy it since it must not have any siblings.) libxml2/libxslt actions are then run against that fake document, which is cleaned up (i.e. deleted) immediately after the action.
One little hack is that it only copies the element, it does not do deep-copying. It should therefore be possible to access higher ancestors that are not below the _context_node of the respective ElementTree via e.g. an XSLT (like via <copy-of select="ancestors::node()"/>). I didn't test that, but I believe it is not too bad, since it is a rather rare case and somehow even fits the ElementTree idea of Elements being 'free'.
Hm, this is rather worrisome. I would expect an ElementTree to be a whole document and this could potentially lead to odd effects depending on the way the XSLT engine works -- basically due to this hack, the operation is performed on a tree that is not actually correct and this could lead to all kinds of unexpected behavior... Additionally, how does the getparent() operation fare in this context? The parent of a node that's wrapped directly in an ElementTree might be something that's not even supposed to be in that tree, which is really unexpected. Or is there a reason this doesn't happen?
I know, it might sound a little confusing at first, but I hope you got the idea.
Yes, I think so, thank you for the explanation. I now need to consider costs and gains of this approach compared to the one lxml is taking currently... Regards, Martijn
Martijn Faassen wrote:
Stefan Behnel wrote:
One little hack is that it only copies the element, it does not do deep-copying. It should therefore be possible to access higher ancestors that are not below the _context_node of the respective ElementTree via e.g. an XSLT (like via <copy-of select="ancestors::node()"/>). I didn't test that, but I believe it is not too bad, since it is a rather rare case and somehow even fits the ElementTree idea of Elements being 'free'.
Hm, this is rather worrisome. I would expect an ElementTree to be a whole document and this could potentially lead to odd effects depending on the way the XSLT engine works -- basically due to this hack, the operation is performed on a tree that is not actually correct and this could lead to all kinds of unexpected behavior...
I find the behaviour of ElementTrees eating Elements much more surprising, especially for people knowing the original ElementTree. Note that it is perfectly reasonable for the root element of an ElementTree to have a parent. You must not forget that it is now possible for elements to be in multiple ElementTrees at the same time, just like in Fredrik's ElementTree. So, once you have understood that Elements can be part of different ElementTrees without loosing their global position in a document (which is basically how Fredrik's ElementTree works), it's even relatively obvious that there may be ancestors that are not part of the current ElementTree that is being worked on. I would even say it should be enough to document that somewhere. People will understand it. Then the 'unexpected' behaviour would no longer be surprising. Also, you should not forget that it is really, really rare that XSLTs actually look for parents. Almost all stylesheets I have seen so far were based on descend matching. That's how XSLT templates work, you apply them to the children of nodes. And since the (fake) root node is really a root node of a complete document (except for the second level parent pointers), this should not pose any problems at all. It's even less a problem for RelaxNG, where AFAICT the algorithms dictates descend matching.
Additionally, how does the getparent() operation fare in this context? The parent of a node that's wrapped directly in an ElementTree might be something that's not even supposed to be in that tree, which is really unexpected. Or is there a reason this doesn't happen?
This is different, since it does not depend on libxml2, so there is no fake document involved. I actually think that this was Fredrik's reasoning behind *not* providing a getparent() function at all. However, in the context of lxml, getparent() is sufficiently useful to accept this behaviour.
I know, it might sound a little confusing at first, but I hope you got the idea.
Yes, I think so, thank you for the explanation. I now need to consider costs and gains of this approach compared to the one lxml is taking currently...
Sure, but I really think the gains are much higher than the explainable parent problem that we introduce. I think this would be much easier to understand for people that are new to lxml but know ElementTree, than for people that have used lxml for too long... Stefan
One more thing I wanted to stress. Stefan Behnel wrote:
So, once you have understood that Elements can be part of different ElementTrees without loosing their global position in a document (which is basically how Fredrik's ElementTree works), it's even relatively obvious that there may be ancestors that are not part of the current ElementTree that is being worked on.
I forgot to stress that we *cannot* remove the restriction of Elements being only in a single *document* at a time, which ET does *not* enforce at all. This is the price we pay for using libxml and the reason why getparent() makes sense in lxml and not in ET. We can, however, remove the API divergence for Elements being in multiple ElementTrees - at the cost of making getparent() ignorant to ElementTrees. If compatibility with ET is the goal, I consider this the right trade-off. Stefan
Stefan Behnel wrote:
One more thing I wanted to stress.
Stefan Behnel wrote:
So, once you have understood that Elements can be part of different ElementTrees without loosing their global position in a document (which is basically how Fredrik's ElementTree works), it's even relatively obvious that there may be ancestors that are not part of the current ElementTree that is being worked on.
I forgot to stress that we *cannot* remove the restriction of Elements being only in a single *document* at a time, which ET does *not* enforce at all. This is the price we pay for using libxml and the reason why getparent() makes sense in lxml and not in ET.
Right.
We can, however, remove the API divergence for Elements being in multiple ElementTrees - at the cost of making getparent() ignorant to ElementTrees.
If compatibility with ET is the goal, I consider this the right trade-off.
I'm not debating anything you said, just trying to think through the implications. The risk exists in the current setup that an algorithm is used somewhere inside libxml2 that takes the parent, and that we break the expectations of this with the current approach. I think it should be possible to create a _fakeRootDoc variety that actually changes the parent pointer temporarily as well, returning the original parent so we can put it back when we destroy the root doc. As these two are actually part of an atomic operation I think this should work, correct? That would at least eliminate concerns about possible algorithms somewhere in libxml2 using the parent pointer. The getparent() issue would still remain. I want to think this through for a bit too. Regards, Martijn
Martijn Faassen wrote:
The risk exists in the current setup that an algorithm is used somewhere inside libxml2 that takes the parent, and that we break the expectations of this with the current approach.
I think it should be possible to create a _fakeRootDoc variety that actually changes the parent pointer temporarily as well, returning the original parent so we can put it back when we destroy the root doc. As these two are actually part of an atomic operation I think this should work, correct? That would at least eliminate concerns about possible algorithms somewhere in libxml2 using the parent pointer.
The _fakeRootDoc/_destroyFakeDoc pair is atomic in the sense that they wrap calls to libxml2/libxslt within a single API call. AFAICT, all actions that I wrapped with _fakeRootDoc do not modify the original document, so this is about pure read access. So, as long as they do not try to access the parent of any of the children of the new root node (that is the only problematic part), everything works just fine and completely as one would expect. If they do, however, they will *read* the parts of the original document that are not part of the ElementTree. This should not pose any fatal problem either, although this may result in unexpected behaviour if you assumed an ElementTree to build an independent document. If we do not give people this idea, even that case should work as expected. That said, since the calls actually are atomic, it should be possible to store the original parent link somewhere, replace the parent links and restore them afterwards. You could even leave the node where it is and store *its* sibling links and parent link, and then NULL them in place during the operation. Or copy the node and then copy it back afterwards. You wouldn't have to provide a variety of functions, though, the modifications should work everywhere. Which approach would you prefer? Stefan
Stefan Behnel wrote:
Martijn Faassen wrote:
The risk exists in the current setup that an algorithm is used somewhere inside libxml2 that takes the parent, and that we break the expectations of this with the current approach.
I think it should be possible to create a _fakeRootDoc variety that actually changes the parent pointer temporarily as well, returning the original parent so we can put it back when we destroy the root doc. As these two are actually part of an atomic operation I think this should work, correct? That would at least eliminate concerns about possible algorithms somewhere in libxml2 using the parent pointer.
The _fakeRootDoc/_destroyFakeDoc pair is atomic in the sense that they wrap calls to libxml2/libxslt within a single API call.
AFAICT, all actions that I wrapped with _fakeRootDoc do not modify the original document, so this is about pure read access. So, as long as they do not try to access the parent of any of the children of the new root node (that is the only problematic part), everything works just fine and completely as one would expect. If they do, however, they will *read* the parts of the original document that are not part of the ElementTree. This should not pose any fatal problem either, although this may result in unexpected behaviour if you assumed an ElementTree to build an independent document. If we do not give people this idea, even that case should work as expected.
True. I guess if we look at ElementTree, the philosophy exists that (parts of) a tree can be shared between elementtree. This will be visible when they're mutated. The problem with lxml is that lxml somewhere *does* have a notion of a document. Nodes can move around and disappear from one place, appear in another, can move between documents, etc. How much do we want to hide this notion from the user? We cannot fully do it, as it'll pop up in some places.
That said, since the calls actually are atomic, it should be possible to store the original parent link somewhere, replace the parent links and restore them afterwards. You could even leave the node where it is and store *its* sibling links and parent link, and then NULL them in place during the operation. Or copy the node and then copy it back afterwards.
You wouldn't have to provide a variety of functions, though, the modifications should work everywhere. Which approach would you prefer?
I'm in favor to have the parent link retained and restored in a local variable always. This because this is least likely to lead to problems with libxml2, which we'd otherwise be handing inconsistent trees. This may work *now* for what we've tested, but it may not work always. I still need to think about the implications of hiding the concept of the libxml2 document from users though. I hope my thoughts about this rather fundamental change don't frustrate you. I also feel I need to work this through before we adopt the other patches, as it changes so much. To add to the frustration, I'll be away on another trip the first half of next week and won't be able to work on lxml. I will continue to think this issue over though. Regards, Martijn
Martijn Faassen wrote:
I'm trying to understand why in scoder2 there's the use of _fakeRootDoc in a few places. I imagine this is to work around some bugs that show up in the trunk, but I don't have enough context on what and why. Stefan, can you enlighten me? Any tests that expose this behavior in particular?
I initially described the problem here: http://codespeak.net/pipermail/lxml-dev/2005-October/000501.html You may also follow my way towards the implementation in that thread. The original implementation is in SVN revision 18907, I copied the test case here, as added in revision 18906. You may notice that these test cases apply to etree and ElementTree, but only etree fails. I think I added further test cases along the road, so you may find other places in the test files where it says "*_multiple_*". The basic problem is always the same, though. Stefan Index: src/lxml/tests/test_etree.py =================================================================== --- src/lxml/tests/test_etree.py (Revision 18905) +++ src/lxml/tests/test_etree.py (Revision 18906) @@ -1544,7 +1544,25 @@ e = etree.Element('foo') e.set('bar', 'Bar') self.assertEquals(False, bool(e)) - + + def test_multiple_elementrees(self): + etree = self.etree + + a = etree.Element('a') + b = etree.SubElement(a, 'b') + + t = etree.ElementTree(a) + self.assertEquals(self._rootstring(t), '<a><b/></a>') + + t1 = etree.ElementTree(a) + self.assertEquals(self._rootstring(t1), '<a><b/></a>') + self.assertEquals(self._rootstring(t), '<a><b/></a>') + + t2 = etree.ElementTree(b) + self.assertEquals(self._rootstring(t2), '<b/>') + self.assertEquals(self._rootstring(t1), '<a><b/></a>') + self.assertEquals(self._rootstring(t), '<a><b/></a>') + def _writeElement(self, element, encoding='us-ascii'): """Write out element for comparison. """ @@ -2142,6 +2160,42 @@ '<doc><foo>Bar</foo><foo>Baz</foo></doc>', etree.tostring(result.getroot())) + def test_multiple_elementrees(self): + tree = self.parse('<a><b>B</b><c>C</c></a>') + style = self.parse('''\ +<xsl:stylesheet version="1.0" + xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> + <xsl:template match="a"><A><xsl:apply-templates/></A></xsl:template> + <xsl:template match="b"><B><xsl:apply-templates/></B></xsl:template> + <xsl:template match="c"><C><xsl:apply-templates/></C></xsl:template> +</xsl:stylesheet>''') + + self.assertEquals(self._rootstring(tree), + '<a><b>B</b><c>C</c></a>') + result = tree.xslt(style) + self.assertEquals(self._rootstring(tree), + '<a><b>B</b><c>C</c></a>') + self.assertEquals(self._rootstring(result), + '<A><B>B</B><C>C</C></A>') + + b_tree = etree.ElementTree(tree.getroot()[0]) + self.assertEquals(self._rootstring(b_tree), + '<b>B</b>') + result = b_tree.xslt(style) + self.assertEquals(self._rootstring(tree), + '<a><b>B</b><c>C</c></a>') + self.assertEquals(self._rootstring(result), + '<B>B</B>') + + c_tree = etree.ElementTree(tree.getroot()[1]) + self.assertEquals(self._rootstring(c_tree), + '<c>C</c>') + result = c_tree.xslt(style) + self.assertEquals(self._rootstring(tree), + '<a><b>B</b><c>C</c></a>') + self.assertEquals(self._rootstring(result), + '<C>C</C>') + class ETreeRelaxNGTestCase(HelperTestCase): def test_relaxng(self): tree_valid = self.parse('<a><b></b></a>') @@ -2191,6 +2245,35 @@ self.assert_(tree_valid.relaxng(schema)) self.assert_(not tree_invalid.relaxng(schema)) + def test_multiple_elementrees(self): + tree = self.parse('<a><b>B</b><c>C</c></a>') + schema = etree.RelaxNG( self.parse('''\ +<element name="a" xmlns="http://relaxng.org/ns/structure/1.0"> + <element name="b"> + <text /> + </element> + <element name="c"> + <text /> + </element> +</element> +''') ) + self.assert_(schema.validate(tree)) + self.assert_(schema.validate(tree)) + + schema = etree.RelaxNG( self.parse('''\ +<element name="b" xmlns="http://relaxng.org/ns/structure/1.0"> + <text /> +</element> +''') ) + c_tree = etree.ElementTree(tree.getroot()[1]) + self.assertEqual(self._rootstring(c_tree), '<c>C</c>') + self.assert_(not schema.validate(c_tree)) + + b_tree = etree.ElementTree(tree.getroot()[0]) + self.assertEqual(self._rootstring(b_tree), '<b>B</b>') + self.assert_(schema.validate(b_tree)) + + class ETreeXMLSchemaTestCase(HelperTestCase): def test_xmlschema(self): tree_valid = self.parse('<a><b></b></a>')
participants (2)
-
Martijn Faassen
-
Stefan Behnel