Mailman 3 [lxml-dev] lxml 2.2 validation question - lxml - The Python XML Toolkit

[lxml-dev] lxml 2.2 validation question

James Slagle

19 May 2009 19 May '09

2:21 a.m.

Hello, I'm having some trouble getting lxml (v. 2.2) to validate an ElementTree object that I'm building and was hoping someone on the list could help and maybe tell me what I'm doing wrong. If I create an ElementTree object directly from xml and an associated schema, it will validate fine. If I then construct a similar ElementTree object by just instantianting ElementTree, it will not validate. The odd thing is that the resulting xml from etree.tostring for both objects is identical. I've attached a python script that shows the problem I'm having. The validation error is: *** DocumentInvalid: Element 'Foo': No matching global declaration available for the validation root. I can get the second ElementTree object (etree2) to validate if I put the long explicit namesplace in front of the tag value (Foo) when I create etree2 in the script. So, if I change line 25 in the script to: rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) , it will validate. However, the 2 resulting xml outputs are no longer equal b/c the output from etree2 is output with explict namespaces. Any ideas? Thanks for any help. -- James Slagle

Attachments:

lxmltest.py (text/x-python — 913 bytes)
attachment.htm (text/html — 1.3 KB)

Show replies by date

jholg＠gmx.de

19 May 19 May

9:06 a.m.

...

I can get the second ElementTree object (etree2) to validate if I put the long explicit namesplace in front of the tag value (Foo) when I create etree2 in the script. So, if I change line 25 in the script to: rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) , it will validate.

And this is the right way to create an element that lives in namespace http://example.com. Some comments:

...

nsmap={None: 'http://example.com', 'foo': 'http://example.com'} rootElem = etree.Element('Foo', {}, nsmap)

Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element. So you could do

...

...
...
rootElem = etree.Element('{http://example.com}Foo', {}, {None: 'http://example.com'}) rootElem.text = '\nContents\n'

...

...
...
rootElem.text = '\nContents\n' print etree.tostring(rootElem, pretty_print=True) <Foo xmlns="http://example.com"> Contents </Foo>

...

...
...
print schemaObj.validate(rootElem)

True

...

which puts Foo into the intended NS and uses this NS unprefixed in the output. But if you do this

...

...
...
rootElem = etree.Element('{http://example.com}Foo', {}, {None: 'http://example.com', 'foo': 'http://example.com'}) rootElem.text = '\nContents\n' print schemaObj.validate(rootElem) True print etree.tostring(rootElem, pretty_print=True) <foo:Foo xmlns:foo="http://example.com" xmlns="http://example.com"> Contents </foo:Foo>

...

...
...

you end up with the foo prefix, the reason for this probably being the order a prefix for the NS http://example.com is found in the given nsmap (dictionaries are unordered).

...

However, the 2 resulting xml outputs are no longer equal b/c the output > from etree2 is output with explict namespaces.

While textual equality is often dubious in XML :) you might cleanup the superfluous namespaces:

...

...
...
rootElem = etree.Element('{http://example.com}Foo', nsmap={None: 'http://example.com'}) rootElem.text = '\nContents\n'

print etree.tostring(rootElem, pretty_print=True) <Foo xmlns="http://example.com"> Contents </Foo>

...

...
...
etree1 = etree.fromstring("""\ ... <Foo xmlns:foo="http://example.com" ... xmlns="http://example.com"> ... Contents ... </Foo> ... """ ... ) etree.tostring(etree1, pretty_print=True) == etree.tostring(rootElem, pretty_print=True) False etree.cleanup_namespaces(etree1)

etree.tostring(etree1, pretty_print=True) == etree.tostring(rootElem, pretty_print=True) True

Or even consider canonicalization, see http://codespeak.net/lxml/api.html#write-c14n-on-elementtree Holger -- Neu: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate + Telefonanschluss für nur 17,95 Euro/mtl.!* http://dslspecial.gmx.de/freedsl-surfflat/?ac=OM.AD.PD003K11308T4569a

James Slagle

11:17 a.m.

On Tue, May 19, 2009 at 5:06 AM, <jholg@gmx.de> wrote:

...

Some comments:

...
nsmap={None: 'http://example.com', 'foo': 'http://example.com'} rootElem = etree.Element('Foo', {}, nsmap)

Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element.

Ok, that clears things up a bit and explains the validation error. I also see how I can use etree.cleanup_namespaces before comparing the xml output. I guess the thing that I find strange is that there seems to be no way to end up with the xml I started with in my example if I instead start by instantiating an ElementTree object first. Either you have the validation error, fully prefixed namespaces, or one of the namespace declarations removed (if you were to use cleanup_namespaces). Thanks for your help! -- James Slagle

Piet van Oostrum

7:36 p.m.

...

...
...
...
...
jholg@gmx.de (j) wrote:

...

j> Some comments:

...

...
...
nsmap={None: 'http://example.com', 'foo': 'http://example.com'} rootElem = etree.Element('Foo', {}, nsmap)

...

j> Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element.

But that means that the serialization: <Foo xmlns="http://example.com"> Contents </Foo> that etree.tostring produces is wrong. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org

Stefan Behnel

20 May 20 May

7:35 a.m.

Piet van Oostrum wrote:

...

...
jholg@gmx.de wrote:

...
...
nsmap={None: 'http://example.com', 'foo': 'http://example.com'} rootElem = etree.Element('Foo', {}, nsmap)

...
Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element.

But that means that the serialization:

<Foo xmlns="http://example.com"> Contents </Foo>

that etree.tostring produces is wrong.

Correct. This is actually an error case that we could catch at the API level: the new element has no namespace *and* the nsmap defines a default namespace, i.e. this should fail: el = etree.Element('thetag', nsmap={None : 'uri:some-namespace'}) We'd then need to make sure that you can write el = etree.Element('{}thetag', nsmap={None : 'uri:some-namespace'}) which would result in something like <ns0:thetag xmlns="uri:some-namespace" xmlns:ns0="" /> Similarly, adding a namespace-less subelement to a tree that defines a default namespace will not work 'as expected' (whatever a user may expect in doing that). In this case, we'd have to cut the default namespace definition by inserting a xmlns="" on the new element, so this case would not be an error. The same applies in the general case where you create a tree without namespaces and one that uses a default namespace, and then insert the namespace-less tree into the other one at some place. Another sick case: el1 = etree.fromstring( '<p:qualified xmlns:p="uri:myns"><nons/></p:qualified>') el2 = etree.Element("{uri:tns}thetag", nsmap={None: "uri:otherns"}) el2.append(el1) So it looks like we'd have to integrate something similar into the namespace fixing mechanism... Ugly, ugly... Stefan

Stefan Behnel

19 May 19 May

5:44 p.m.

James Slagle wrote:

...

I'm having some trouble getting lxml (v. 2.2) to validate an ElementTree object that I'm building and was hoping someone on the list could help and maybe tell me what I'm doing wrong.

If I create an ElementTree object directly from xml and an associated schema, it will validate fine. If I then construct a similar ElementTree object by just instantianting ElementTree, it will not validate. The odd thing is that the resulting xml from etree.tostring for both objects is identical.

I've attached a python script that shows the problem I'm having. The validation error is: *** DocumentInvalid: Element 'Foo': No matching global declaration available for the validation root.

I can get the second ElementTree object (etree2) to validate if I put the long explicit namesplace in front of the tag value (Foo) when I create etree2 in the script. So, if I change line 25 in the script to: rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) , it will validate.

However, the 2 resulting xml outputs are no longer equal b/c the output from etree2 is output with explict namespaces.

With "explicit", do you mean that it uses namespace prefixes instead of the default namespace? lxml.etree internally does some namespace cleanup on the fly and (re-)maps the namespaces of qualified tag names ("{abc}tag") to namespace prefixes depending on the place you insert an Element into a tree. Doing so, it will only use one namespace declaration for each mapping, even if you redeclare a namespace with more than one prefix. A side effect is that a namespace declaration may end up being unused if lxml finds a different declaration first. Anyway, a few things to note here: 1) namespace prefixes are highly overrated 2) the default namespace is highly overused, especially when mixed with other (prefixed) namespaces 3) it is rarely (not 'never', but 'rarely') useful to declare the same namespace more than once. 4) comparing textual representations of XML documents is futile most of the time, except for their C14N serialisation. Stefan

James Slagle

6:18 p.m.

On Tue, May 19, 2009 at 1:44 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

With "explicit", do you mean that it uses namespace prefixes instead of the default namespace?

Yes, exactly. So in my example, it is output as <foo:Foo>, instead of just <Foo>.

...

lxml.etree internally does some namespace cleanup on the fly and (re-)maps the namespaces of qualified tag names ("{abc}tag") to namespace prefixes depending on the place you insert an Element into a tree. Doing so, it will only use one namespace declaration for each mapping, even if you redeclare a namespace with more than one prefix. A side effect is that a namespace declaration may end up being unused if lxml finds a different declaration first.

Anyway, a few things to note here:

1) namespace prefixes are highly overrated 2) the default namespace is highly overused, especially when mixed with other (prefixed) namespaces 3) it is rarely (not 'never', but 'rarely') useful to declare the same namespace more than once.

My main issue is with an external tool I'm passing my generated xml to. This tool expects no prefixes on the elements, but prefixes on the attributes, and thus needs the namespace declared with a prefix and as the default. Yes, I know this is broken, and the tool needs to be fixed to be more flexible :). I was mainly wanting to know if it was possible to use lxml to generate xml output in this manner. Thanks. -- James Slagle

Stefan Behnel

20 May 20 May

7:47 a.m.

James Slagle wrote:

...

My main issue is with an external tool I'm passing my generated xml to. This tool expects no prefixes on the elements, but prefixes on the attributes, and thus needs the namespace declared with a prefix and as the default. Yes, I know this is broken, and the tool needs to be fixed to be more flexible :).

... "to support XML namespaces", you mean. ;-)

...

I was mainly wanting to know if it was possible to use lxml to generate xml output in this manner.

I recall adding a namespace setup rule that explicitly prefers the prefixed namespace over an equivalent default namespace when you define both on the same node. This is because otherwise you end up with a similar problem with unnamespaced attributes on namespaced elements. The output you get is a side effect of that fix, so, no, there isn't a way to define a namespace as both the default namespace and a prefixed namespace, and basically let lxml ignore the prefixed namespace in favour of the default. That said, was there actually a reason why you defined the namespace prefix in the first place? Why isn't the default namespace enough to do what you want? (Note that the prefix used in the schema is independent of the one used in the document, that's what I meant with prefixes being 'overrated'.) Stefan

James Slagle

7:27 p.m.

On Wed, May 20, 2009 at 3:47 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

That said, was there actually a reason why you defined the namespace prefix in the first place? Why isn't the default namespace enough to do what you want? (Note that the prefix used in the schema is independent of the one used in the document, that's what I meant with prefixes being 'overrated'.)

There was no good reason. I was merely trying to match the example xml that I have for working with this particular tool. Thanks for the helpful information, it has certainly helped clear things up for me. -- -- James Slagle --

5632

Age (days ago)

5633

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

James Slagle
jholg＠gmx.de
Piet van Oostrum
Stefan Behnel

[lxml-dev] lxml 2.2 validation question

James Slagle

jholg＠gmx.de

James Slagle

Piet van Oostrum

Stefan Behnel

Stefan Behnel

James Slagle

Stefan Behnel

James Slagle

tags

participants (4)