[lxml-dev] Problem with using the same URI twice in a namespace
I assume it is legal to have to following namespace declaration/usage: <top xmlns="a" xmlns:a="a" xmlns:b="b"> <foo bar=""/> <b:foobar a:bar=""/> </top> It works when I read such a definition with lxml.etree.parse, but I can't construct it with lxml.etree.Element because then the nsmap dict will be normalized in such a way that each URI occurs only once. Is this a bug in lxml or shouldn't it be used in this way? cheers Andreas
Hi, Andreas Degert wrote:
I assume it is legal to have to following namespace declaration/usage:
<top xmlns="a" xmlns:a="a" xmlns:b="b"> <foo bar=""/> <b:foobar a:bar=""/> </top>
Sure, the spec calls this well-formed XML - not talking aesthetics, though.
It works when I read such a definition with lxml.etree.parse, but I can't construct it with lxml.etree.Element because then the nsmap dict will be normalized in such a way that each URI occurs only once.
Finally someone complaining that there are too *few* namespace declarations instead of too many. ;o) lxml does a lot of work behind the scenes to keep namespaces consistent and simple throughout whatever operation you affect at the API level. In the case you describe, lxml checks on each new namespace prefix declaration if that namespace is already defined in the tree context of the Element and reuses the old prefix if that is the case. The function that does that is _initNodeNamespaces() in apihelpers.pxi, in case you're interested.
Is this a bug in lxml or shouldn't it be used in this way?
I don't see the use case. What could you do with redundant namespace prefix declarations that you can't do with a single one? Imagine you have two prefixes defined for a namespace and you add a subelement with that namespace. Which prefix should be used? What purpose does that ambiguity serve? Stefan
On Tue, 22 Apr 2008 22:42:12 +0200 Stefan Behnel <stefan_ml@behnel.de> wrote:
Hi,
Andreas Degert wrote:
I assume it is legal to have to following namespace declaration/usage:
<top xmlns="a" xmlns:a="a" xmlns:b="b"> <foo bar=""/> <b:foobar a:bar=""/> </top>
Sure, the spec calls this well-formed XML - not talking aesthetics, though.
It works when I read such a definition with lxml.etree.parse, but I can't construct it with lxml.etree.Element because then the nsmap dict will be normalized in such a way that each URI occurs only once.
Finally someone complaining that there are too *few* namespace declarations instead of too many. ;o)
lxml does a lot of work behind the scenes to keep namespaces consistent and simple throughout whatever operation you affect at the API level. In the case you describe, lxml checks on each new namespace prefix declaration if that namespace is already defined in the tree context of the Element and reuses the old prefix if that is the case. The function that does that is _initNodeNamespaces() in apihelpers.pxi, in case you're interested.
Is this a bug in lxml or shouldn't it be used in this way?
I don't see the use case. What could you do with redundant namespace prefix declarations that you can't do with a single one?
I think the behaviour leads to a bug: t = Element("top",nsmap={None:"a","b":"b"}) SubElement(t, "{b}foobar", {"{a}bar":""}) print tostring(t, pretty_print=True) ----- <top xmlns="a" xmlns:b="b"> <b:foobar bar=""/> </top> ----- In the output the attribute bar should have namespace a, but it has no namespace (the default namespace doesn't apply to attributes as specified in http://www.w3.org/TR/REC-xml-names/#scoping-defaulting, section 6.2). hmmm... even simpler example: Element("top", {"bar":"", "{a}bar":""}, nsmap={None:"a","b":"b"}) yields <top xmlns="a" xmlns:b="b" bar="" bar=""/>
Imagine you have two prefixes defined for a namespace and you add a subelement with that namespace. Which prefix should be used? What purpose does that ambiguity serve?
The default namespace is a special case because it doesn't apply to attributes (this means when attributes have a namespace value they must be serialized with a prefix). When serializing elements the default namespace should have a higher priority, i.e. those elements can be written without prefix.
Stefan
Andreas Degert wrote:
In the output the attribute bar should have namespace a, but it has no namespace (the default namespace doesn't apply to attributes as specified in http://www.w3.org/TR/REC-xml-names/#scoping-defaulting, section 6.2).
Element("top", {"bar":"", "{a}bar":""}, nsmap={None:"a","b":"b"})
yields <top xmlns="a" xmlns:b="b" bar="" bar=""/>
The default namespace is a special case because it doesn't apply to attributes (this means when attributes have a namespace value they must be serialized with a prefix).
I see the problem. Actually, now that you mention it, it is not uncommon to define multiple prefixes for a namespace, e.g. in XSD or WSDL. Maybe we can somehow prioritise namespace declarations on the way in, or special case the default namespace in the cleanup procedure (like: making sure it comes last in the declaration list, although that wouldn't impact the parser). It would be nice to have some simple rules how to check that this has to be done, as it definitely adds overhead. I could even accept not simplifying the nsmap at all, but there still is the problem of namespace cleanup when moving elements (moveNodeToDocument() in proxi.pxi). We would need special rules there, too, like: allow adding a second prefix for the default namespace - no idea if that case is easy to recognise and handle.
When serializing elements the default namespace should have a higher priority, i.e. those elements can be written without prefix.
The serialiser is part of libxml2. If you want changes in this part of lxml, ask on the libxml2 mailing list. However, I think the more general problem is in lxml here. Stefan
Andreas Degert wrote:
I think the behaviour leads to a bug:
t = Element("top",nsmap={None:"a","b":"b"}) SubElement(t, "{b}foobar", {"{a}bar":""}) print tostring(t, pretty_print=True) ----- <top xmlns="a" xmlns:b="b"> <b:foobar bar=""/> </top> -----
This is definitely a problem in the serialiser of libxml2:
t = Element("top",nsmap={None:"a","b":"b",'a':'a'}) SubElement(t, "{b}foobar", {"{a}bar":""}) <Element {b}foobar at b798dd9c> print tostring(t, pretty_print=True) <top xmlns="a" xmlns:a="a" xmlns:b="b"> <b:foobar bar=""/> </top>
It would have to prefer the prefixed namespace instead of the default one to get this right. But this does not come for free, imagine this case: <top xmlns:a="a" xmlns:b="b"> <test xmlns="a"> <b:foobar bar=""/> </test> </top> So it would always have to check the entire root path if the attribute target namespace is defined with an empty prefix, and the current element has a different namespace. Stefan
participants (2)
-
Andreas Degert
-
Stefan Behnel