[lxml-dev] Request for help on testing new libxml2 feature
Hi, regarding the remove-redundant-namespaces issue there are news: kbuchcik implemented the xmlDOMWrapReconcileNamespaces in tree.c of libxml2 so it should remove redundant NS decl. However neither do I have any experience with libxml2 nor do I have the time to dig into it so that I can build a test program for this. Thus I ask you guys here, who surely are libxml2 experts, if you could help me out here. Either some "hack" for lxml that allows me to test this or a small programm that takes an xml file and applies this function to it's DOM tree (and outputs the result) would be really great. Thanks, Andreas -- You now have Asian Flu.
Hi, On Wed, 2006-02-01 at 20:55 +0100, Andreas Pakulat wrote:
Hi,
regarding the remove-redundant-namespaces issue there are news:
kbuchcik implemented the xmlDOMWrapReconcileNamespaces in tree.c of libxml2 so it should remove redundant NS decl. However neither do I have any experience with libxml2 nor do I have the time to dig into it so that I can build a test program for this.
Thus I ask you guys here, who surely are libxml2 experts, if you could help me out here. Either some "hack" for lxml that allows me to test this or a small programm that takes an xml file and applies this function to it's DOM tree (and outputs the result) would be really great.
Note that removal of redundant ns-decls in xmlDOMWrapReconcileNamespaces() was committed to CVS just yesterday and I fixed some bugs today; I peformed only rudimental tests, so more testing would be appreciated. Andreas' bug-entry: http://bugzilla.gnome.org/show_bug.cgi?id=329347 Regards, Kasimier
Kasimier Buchcik wrote:
On Wed, 2006-02-01 at 20:55 +0100, Andreas Pakulat wrote:
regarding the remove-redundant-namespaces issue there are news:
kbuchcik implemented the xmlDOMWrapReconcileNamespaces in tree.c of libxml2 so it should remove redundant NS decl. However neither do I have any experience with libxml2 nor do I have the time to dig into it so that I can build a test program for this.
Thus I ask you guys here, who surely are libxml2 experts, if you could help me out here. Either some "hack" for lxml that allows me to test this or a small programm that takes an xml file and applies this function to it's DOM tree (and outputs the result) would be really great.
Note that removal of redundant ns-decls in xmlDOMWrapReconcileNamespaces() was committed to CVS just yesterday and I fixed some bugs today; I peformed only rudimental tests, so more testing would be appreciated.
Andreas' bug-entry: http://bugzilla.gnome.org/show_bug.cgi?id=329347
Ok, I think we cannot easily depend on CVS versions of libxml2 in lxml, so this rather experimental feature will not be supported in lxml for a while. As a work around this kind of problems with libxml2 versions, we *could* support something like conditional *compilation* in lxml, depending on the libxml2 version. It's not beautiful, but I could imagine something like this, related to the new I/O API and Geert's/Patrick's patches: class XMLFormatter: def __init__(..., xhtml=False, ...): if xhtml: if LIBXML_VERSION < 20623: raise VersionError, \ "Libxml version >= 2.6.23 needed for XHTML formatting" else: ... This would allow us to keep a consistent API, while supporting different versions of libxml2 and their diverging features. This obviously only works for libxml2 versions that provide all API functions used by the lxml code, since they must be available at compile time. I actually haven't tested this, but it may even work. Anyway, for now, this is only an idea. So far, we do not even have code that needs this. But is may make the change to Geert's output features simpler and may also allow us to support experimental features of libxml2. Stefan
Andreas Pakulat wrote:
regarding the remove-redundant-namespaces issue there are news:
kbuchcik implemented the xmlDOMWrapReconcileNamespaces in tree.c of libxml2 so it should remove redundant NS decl. However neither do I have any experience with libxml2 nor do I have the time to dig into it so that I can build a test program for this.
Thus I ask you guys here, who surely are libxml2 experts, if you could help me out here. Either some "hack" for lxml that allows me to test this or a small programm that takes an xml file and applies this function to it's DOM tree (and outputs the result) would be really great.
Hi, Here is a trivial patch that simply calls the function after having copied an element between documents. I think this shouldn't do any harm since the new CVS options will be ignored by older libxml2 versions. Could you please apply it to the current lxml SVN version and test it against the libxml2 CVS version to see if it helps with the redundant namespace problems? Stefan Index: src/lxml/etree.pyx =================================================================== --- src/lxml/etree.pyx (Revision 23981) +++ src/lxml/etree.pyx (Arbeitskopie) @@ -1369,7 +1369,8 @@ """ changeDocumentBelowHelper(node._c_node, doc) tree.xmlReconciliateNs(doc._c_doc, node._c_node) - + tree.xmlDOMWrapReconcileNamespaces(NULL, node._c_node, 1) + cdef void changeDocumentBelowHelper(xmlNode* c_node, _Document doc): cdef ProxyRef* ref cdef xmlNode* c_current Index: src/lxml/tree.pxd =================================================================== --- src/lxml/tree.pxd (Revision 23981) +++ src/lxml/tree.pxd (Arbeitskopie) @@ -154,6 +154,8 @@ cdef xmlDoc* xmlCopyDoc(xmlDoc* doc, int recursive) cdef xmlNode* xmlCopyNode(xmlNode* node, int extended) cdef int xmlReconciliateNs(xmlDoc* doc, xmlNode* tree) + cdef int xmlDOMWrapReconcileNamespaces(void* ctxt, xmlNode* tree, + int options) cdef xmlBuffer* xmlBufferCreate() cdef char* xmlBufferContent(xmlBuffer* buf)
On 05.03.06 13:00:49, Stefan Behnel wrote:
Andreas Pakulat wrote:
regarding the remove-redundant-namespaces issue there are news:
kbuchcik implemented the xmlDOMWrapReconcileNamespaces in tree.c of libxml2 so it should remove redundant NS decl. However neither do I have any experience with libxml2 nor do I have the time to dig into it so that I can build a test program for this.
Thus I ask you guys here, who surely are libxml2 experts, if you could help me out here. Either some "hack" for lxml that allows me to test this or a small programm that takes an xml file and applies this function to it's DOM tree (and outputs the result) would be really great.
Hi,
Here is a trivial patch that simply calls the function after having copied an element between documents.
If I understand that correctly it should also work if I create a new Element (with a namespace) and insert it as child right? If that is correct, than this doesn't help. I still get a an extra ns declaration:
print etree.tostring(tree) <elem xmlns="test"><subelem/></elem> [25296 refs] tree.append(etree.Element("{test}sub1")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/></elem> [25296 refs] tree.append(etree.Element("{test}sub2")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/><ns0:sub2 xmlns:ns0="test"/></elem>
BTW: Stefan, the setup.py was correct in using xslt-config to get the compiling parameters which of course is part of libxslt, which I first forgot (and that's also why I got the missing function when importing). Andreas -- You will be misunderstood by everyone.
Andreas Pakulat wrote:
On 05.03.06 13:00:49, Stefan Behnel wrote:
Here is a trivial patch that simply calls the function after having copied an element between documents.
If I understand that correctly it should also work if I create a new Element (with a namespace) and insert it as child right?
If that is correct, than this doesn't help. I still get a an extra ns declaration:
print etree.tostring(tree) <elem xmlns="test"><subelem/></elem> [25296 refs] tree.append(etree.Element("{test}sub1")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/></elem> [25296 refs] tree.append(etree.Element("{test}sub2")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/><ns0:sub2 xmlns:ns0="test"/></elem>
If I'm not mistaken, this is the expected behaviour from my patch. The problem is that it only fixes the tree of the element itself, not the entire tree. If you added the tree itself to a new tree, it should fix the current douplication of namespaces that you saw. I know this is not quite what was intended, but could you check that this happens? That would tell us that the libxml2 function works so far. We'd then have to fix a way for calling it at the right place... Stefan
On 05.03.06 21:46:33, Stefan Behnel wrote:
Andreas Pakulat wrote:
On 05.03.06 13:00:49, Stefan Behnel wrote:
Here is a trivial patch that simply calls the function after having copied an element between documents.
If I understand that correctly it should also work if I create a new Element (with a namespace) and insert it as child right?
If that is correct, than this doesn't help. I still get a an extra ns declaration:
print etree.tostring(tree) <elem xmlns="test"><subelem/></elem> [25296 refs] tree.append(etree.Element("{test}sub1")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/></elem> [25296 refs] tree.append(etree.Element("{test}sub2")) [25296 refs] print etree.tostring(tree) <elem xmlns="test"><subelem/><ns0:sub1 xmlns:ns0="test"/><ns0:sub2 xmlns:ns0="test"/></elem>
If I'm not mistaken, this is the expected behaviour from my patch. The problem is that it only fixes the tree of the element itself, not the entire tree. If you added the tree itself to a new tree, it should fix the current douplication of namespaces that you saw.
So the following should not happen, if I understand you correctly?
from lxml.etree import * [25180 refs] doc = fromstring("<elem xmlns=\"test\" />") [25226 refs] doc.append("{test}sub") Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 397, in etree._Element.append TypeError: Argument 'element' has incorrect type (expected etree._Element, got str) [25278 refs] doc.append(Element("{test}sub")) [25278 refs] tostring(doc) '<elem xmlns="test"><ns0:sub xmlns:ns0="test"/></elem>' [25280 refs] doc2 = fromstring("<main />") [25284 refs] doc2.append(doc) [25284 refs] tostring(doc2) '<main><elem xmlns="test"><sub xmlns:ns0="test"/></elem></main>' [25284 refs]
However this works:
doc = fromstring("<ns0:elem xmlns:ns0=\"test\" />") [25285 refs] doc.append(Element("{test}sub")) [25285 refs] tostring(doc) '<ns0:elem xmlns:ns0="test"><ns0:sub/></ns0:elem>' [25285 refs]
So at least something works (my system lxml doesn't show this behaviour). However I think this is the normal ns-cleanup working and it doesn't fix the bug I reported with libxml... Andreas -- You have an unusual equipment for success. Be sure to use it properly.
Andreas Pakulat wrote:
So the following should not happen, if I understand you correctly?
from lxml.etree import * [25180 refs] doc = fromstring("<elem xmlns=\"test\" />") [25226 refs] doc.append("{test}sub") Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 397, in etree._Element.append TypeError: Argument 'element' has incorrect type (expected etree._Element, got str) [25278 refs] doc.append(Element("{test}sub")) [25278 refs] tostring(doc) '<elem xmlns="test"><ns0:sub xmlns:ns0="test"/></elem>' [25280 refs] doc2 = fromstring("<main />") [25284 refs] doc2.append(doc) [25284 refs] tostring(doc2) '<main><elem xmlns="test"><sub xmlns:ns0="test"/></elem></main>'
Correct, that should not happen, as my patch calls the DOMWrap function on "elem" at the before last line. So I would assume the modified libxml2 function doesn't solve the problem.
However this works:
doc = fromstring("<ns0:elem xmlns:ns0=\"test\" />") [25285 refs] doc.append(Element("{test}sub")) [25285 refs] tostring(doc) '<ns0:elem xmlns:ns0="test"><ns0:sub/></ns0:elem>' [25285 refs]
So at least something works (my system lxml doesn't show this behaviour). However I think this is the normal ns-cleanup working and it doesn't fix the bug I reported with libxml...
Right, the Element() call should create an ns0 prefix, which is then merged with the existing one. So that should just work as before. Stefan
Hi, On Mon, 2006-03-06 at 07:51 +0100, Stefan Behnel wrote:
Andreas Pakulat wrote:
[...]
'<main><elem xmlns="test"><sub xmlns:ns0="test"/></elem></main>'
Correct, that should not happen, as my patch calls the DOMWrap function on "elem" at the before last line. So I would assume the modified libxml2 function doesn't solve the problem.
However this works:
doc = fromstring("<ns0:elem xmlns:ns0=\"test\" />") [25285 refs] doc.append(Element("{test}sub")) [25285 refs] tostring(doc) '<ns0:elem xmlns:ns0="test"><ns0:sub/></ns0:elem>' [25285 refs]
So at least something works (my system lxml doesn't show this behaviour). However I think this is the normal ns-cleanup working and it doesn't fix the bug I reported with libxml...
Right, the Element() call should create an ns0 prefix, which is then merged with the existing one. So that should just work as before.
The function xmlDOMWrapReconcileNamespaces() does not try to eliminate namespace declarations for different namespace prefixes. This is due to QNames in attribute/element content. QNames need a corresponding ns-prefix to be in scope; thus Libxml2 tries to avoid automatic renaming of prefixes. Example: <x:foo xmlns:x="urn:test:FOO"> <y:bar xmlns:y="urn:test:FOO">y:myQNameValue</y:bar> </x:foo> An elimination of the ns-decl with the "y" prefix would break the QName. So if lxml does somehow create distinct ns-prefixes (I'm not familiar with lxml's mechanism here), then the current elimination mechanism won't be usefull. Regards, Kasimier
On 06.03.06 14:06:37, Kasimier Buchcik wrote:
The function xmlDOMWrapReconcileNamespaces() does not try to eliminate namespace declarations for different namespace prefixes.
But that's exactly what the libxml bugreport is about.
This is due to QNames in attribute/element content. QNames need a corresponding ns-prefix to be in scope; thus Libxml2 tries to avoid automatic renaming of prefixes.
Now I'm not too familiar with the specs, but does ":" in element content need escaping? If not, then how can you distinguish a string content containing ":" at some point from a QName as element content, if you don't have an XML Schema at hand that tells you?
Example:
<x:foo xmlns:x="urn:test:FOO"> <y:bar xmlns:y="urn:test:FOO">y:myQNameValue</y:bar> </x:foo>
For my personal use-case it would be sufficient if the bar element could take the prefix from foo and you leave the extra ns-decl in it so the QName is still in scope.
An elimination of the ns-decl with the "y" prefix would break the QName.
You could change it's prefix too, however you probably need to do that on the whole subtree of bar right? Andreas -- You will overcome the attacks of jealous associates.
Hi, On Mon, 2006-03-06 at 14:44 +0100, Andreas Pakulat wrote:
On 06.03.06 14:06:37, Kasimier Buchcik wrote:
The function xmlDOMWrapReconcileNamespaces() does not try to eliminate namespace declarations for different namespace prefixes.
But that's exactly what the libxml bugreport is about.
Then I'm not eager to implement this. But maybe someone else will enhance the function to do what you want.
This is due to QNames in attribute/element content. QNames need a corresponding ns-prefix to be in scope; thus Libxml2 tries to avoid automatic renaming of prefixes.
Now I'm not too familiar with the specs, but does ":" in element content need escaping? If not, then how can you distinguish a string content containing ":" at some point from a QName as element content, if you don't have an XML Schema at hand that tells you?
This is exactly the problem: the tree modification functions do not know where you intended to use QNames, so currently the only robust way to keep the correct prefix for a QName in scope, is to avoid modifiying prefixes of ns-declarations by QName-in-text-content ignorant mechanisms.
Example:
<x:foo xmlns:x="urn:test:FOO"> <y:bar xmlns:y="urn:test:FOO">y:myQNameValue</y:bar> </x:foo>
For my personal use-case it would be sufficient if the bar element could take the prefix from foo and you leave the extra ns-decl in it so the QName is still in scope.
Hmm, what you describe here is not an elimination of redundant ns-declarations.
An elimination of the ns-decl with the "y" prefix would break the QName.
You could change it's prefix too, however you probably need to do that on the whole subtree of bar right?
As you already correctly observed, we cannot change the prefix of the QName, since the Libxml2 functions do not know where you intended to use QNames and where not. Regards, Kasimier
On 06.03.06 15:54:01, Kasimier Buchcik wrote:
On Mon, 2006-03-06 at 14:44 +0100, Andreas Pakulat wrote:
On 06.03.06 14:06:37, Kasimier Buchcik wrote:
The function xmlDOMWrapReconcileNamespaces() does not try to eliminate namespace declarations for different namespace prefixes.
But that's exactly what the libxml bugreport is about.
Then I'm not eager to implement this. But maybe someone else will enhance the function to do what you want.
:-(
This is due to QNames in attribute/element content. QNames need a corresponding ns-prefix to be in scope; thus Libxml2 tries to avoid automatic renaming of prefixes.
Now I'm not too familiar with the specs, but does ":" in element content need escaping? If not, then how can you distinguish a string content containing ":" at some point from a QName as element content, if you don't have an XML Schema at hand that tells you?
This is exactly the problem: the tree modification functions do not know where you intended to use QNames, so currently the only robust way to keep the correct prefix for a QName in scope, is to avoid modifiying prefixes of ns-declarations by QName-in-text-content ignorant mechanisms.
I guess these modification function cannot use a xml schema document that is references by the xml document? If they could, you could say: All element content that is not an element itself is a string, which would be OK with the XML spec, AFAIK. This way you would either know (from the schema) that the content is a QName (or can contain one) or treat it as simple text.
Example:
<x:foo xmlns:x="urn:test:FOO"> <y:bar xmlns:y="urn:test:FOO">y:myQNameValue</y:bar> </x:foo>
For my personal use-case it would be sufficient if the bar element could take the prefix from foo and you leave the extra ns-decl in it so the QName is still in scope.
Hmm, what you describe here is not an elimination of redundant ns-declarations.
Right, as I said this is just my usecase, where a document like <foo xmlns="myuri"> <bar attribute1="blub" /> </foo> is turned into something like the following, if I insert new Elements not using SubElement class, but the Element one: <foo xmlns="myuri"> <bar attribute1="blub" /> <ns0:but xmlns:ns0="myuri">content</ns0:but> </foo> And I'd like to avoid this extra namespace declaration. Also I'm going to add new elements very often and thus the xml document is only machine-readable afterwards, because it's cluttered with namespaces. Andreas -- You will pass away very quickly.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andreas Pakulat wrote:
On 06.03.06 15:54:01, Kasimier Buchcik wrote:
This is due to QNames in attribute/element content. QNames need a corresponding ns-prefix to be in scope; thus Libxml2 tries to avoid automatic renaming of prefixes.
Now I'm not too familiar with the specs, but does ":" in element content need escaping? If not, then how can you distinguish a string content containing ":" at some point from a QName as element content, if you don't have an XML Schema at hand that tells you?
This is exactly the problem: the tree modification functions do not know where you intended to use QNames, so currently the only robust way to keep the correct prefix for a QName in scope, is to avoid modifiying prefixes of ns-declarations by QName-in-text-content ignorant mechanisms.
I guess these modification function cannot use a xml schema document that is references by the xml document? If they could, you could say: All element content that is not an element itself is a string, which would be OK with the XML spec, AFAIK. This way you would either know (from the schema) that the content is a QName (or can contain one) or treat it as simple text.
Frankly, I think QNames in element text / attribute values are such a rare edge case that they could be neglected; making the namespace-compacting stuff an option leaves them "safe", while still allowing the dominant case to clean up nicely. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEDKXP+gerLs4ltQ4RAibQAKC+ZdVGHbHvgFuWm00MNGNelWGgCgCgrqkd utSo2Umwp+f90jH4GzixUlE= =opUL -----END PGP SIGNATURE-----
participants (4)
-
Andreas Pakulat
-
Kasimier Buchcik
-
Stefan Behnel
-
Tres Seaver