lxml not ignoring whitespace when validating xsd:int
Hello, I have asked this already on stackoverflow but figured that I'd better should be asking here. The following python script contains a simple XML schema defining an element 'a' of integer type and an XML document containing such an element. When validating the document against the schema the validation fails. ---%<--- from lxml import etree from StringIO import StringIO xmlschema = etree.XMLSchema(etree.parse(StringIO('''\ <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="a" type="xsd:int"/> </xsd:schema> '''))) xmldoc = etree.parse(StringIO("<a> 42</a>")) print xmlschema.validate(xmldoc) --->%--- According to XML Schema Part 2: Datatypes Second Edition, section 4.3.6 all atomic data types other than 'string' have their 'whiteSpace' constraint set to 'collapse', so I think the element 'a' should be valid. Am I mistaken or is this a bug? I have found a similar issue on S/O regarding the atomic type dateTime which unfortunately has not solution up to now. Regards, Markus
Hi,
Von: Markus Schöpflin <markus.schoepflin@comsoft.aero> [...] The following python script contains a simple XML schema defining an element 'a' of integer type and an XML document containing such an element. When validating the document against the schema the validation fails.
---%<--- from lxml import etree from StringIO import StringIO
xmlschema = etree.XMLSchema(etree.parse(StringIO('''\ <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="a" type="xsd:int"/> </xsd:schema> ''')))
xmldoc = etree.parse(StringIO("<a> 42</a>"))
print xmlschema.validate(xmldoc) --->%---
According to XML Schema Part 2: Datatypes Second Edition, section 4.3.6 all atomic data types other than 'string' have their 'whiteSpace' constraint set to 'collapse', so I think the element 'a' should be valid.
Am I mistaken or is this a bug? I have found a similar issue on S/O regarding the atomic type dateTime which unfortunately has not solution up to now.
My interpretation of the XML Schema Rec is also that this should be valid. FWIW, oXygen validates it just fine using Xerces or Saxon for validation. libxml2 bug, I'd say. Funny that this hasn't bitten people more often. It looks like creators of xml documents are more disciplined than you'd think (wrt to this little insanity), after all ;-) Unfortunately this can't be easily worked around by using a whitespace-ignoring parser as such (leaf element) whitespace is not deemed ignorable. Of course you could loop over your tree and strip leading and trailing whitespace from your (leaf) elements after parsing, if it's safe for you to do so, before validation. Unrelated: You could use etree.fromstring() instead of parse() here so no need for StringIO. Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
Am 19.03.2014 10:35, schrieb Holger Joukl: [...]
According to XML Schema Part 2: Datatypes Second Edition, section 4.3.6 all atomic data types other than 'string' have their 'whiteSpace' constraint set to 'collapse', so I think the element 'a' should be valid.
Am I mistaken or is this a bug? I have found a similar issue on S/O regarding the atomic type dateTime which unfortunately has not solution up to now.
My interpretation of the XML Schema Rec is also that this should be valid. FWIW, oXygen validates it just fine using Xerces or Saxon for validation.
libxml2 bug, I'd say.
I even found the following comments in the libxml2 sources (xmlschemas.c / xmlSchemaTypeFixupWhitespace()): /* * For all `atomic` datatypes other than string (and types `derived` * by `restriction` from it) the value of whiteSpace is fixed to * collapse */ This reads like it should actually work. I will raise a bug for this.
Funny that this hasn't bitten people more often. It looks like creators of xml documents are more disciplined than you'd think (wrt to this little insanity), after all ;-)
Unfortunately this can't be easily worked around by using a whitespace-ignoring parser as such (leaf element) whitespace is not deemed ignorable.
Of course you could loop over your tree and strip leading and trailing whitespace from your (leaf) elements after parsing, if it's safe for you to do so, before validation.
I still need the tree afterwards, so I don't think this is an option.
Unrelated: You could use etree.fromstring() instead of parse() here so no need for StringIO.
Thanks for the hint. Markus
participants (2)
-
Holger Joukl
-
Markus Schöpflin