lxml.etree.iterparse schema validation does not validate

Hi, I'm not sure whether I use the iterparse interface correctly but I did a look inside the tests for it (src/lxml/tests/test_xmlschema.py) and found out that the validation results of iterparse provided with a schema and schema.validate or schema.assertValid using a parsed tree (via etree.parse) differ. The probably necessary information to reproduce: (run on an xubuntu 18.04 with or without manually updated lxml via pip install -U lxml) Python 2.7.17 (default, Jul 20 2020, 15:37:01) [GCC 7.5.0] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys
from lxml import etree
print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=2, minor=7, micro=17, releaselevel='final', serial=0)
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (4, 5, 2, 0)
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 9, 10)
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 10)
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 34)
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 34)
The minimal script one could use to reproduce (based on your test test_xmlschema_iterparse_fail in src/lxml/tests/test_xmlschema.py): from lxml import etree from io import BytesIO import StringIO schema = etree.parse(StringIO.StringIO(''' <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="a" type="AType"/> <xsd:complexType name="AType"> <xsd:sequence> <xsd:element name="b" type="xsd:string" /> </xsd:sequence> </xsd:complexType> </xsd:schema> ''')) schema = etree.XMLSchema(schema) raw_tree = BytesIO('<a><c></c></a>') etree.iterparse(raw_tree, schema=schema) tree = etree.parse(raw_tree) if not schema.validate(tree): print('Error: Different validation results:') schema.assertValid(tree) After some more research I found out that also the test_xmlschema_iterparse_fail seems to be broken. Using the one below which I corrected using with self.assertRaises instead to ensure not raised but expected exceptions also lead to a failing test you could reproduce the bug (if you agree that it is one) also within your test framework as there is no exception raised by iterparse invocation: def test_xmlschema_iterparse_fail(self): schema = self.parse(''' <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:element name="a" type="AType"/> <xsd:complexType name="AType"> <xsd:sequence> <xsd:element name="b" type="xsd:string" /> </xsd:sequence> </xsd:complexType> </xsd:schema> ''') schema = etree.XMLSchema(schema) with self.assertRaises(etree.XMLSyntaxError): etree.iterparse(BytesIO('<a><c></c></a>'), schema=schema) make test invocation result is then something like: FAIL: test_xmlschema_iterparse_fail (lxml.tests.test_xmlschema.ETreeXMLSchemaTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "/home/user/git/lxml/src/lxml/tests/test_xmlschema.py", line 289, in test_xmlschema_iterparse_fail etree.iterparse(BytesIO('<a><c></c></a>'), schema=schema) File "/usr/lib/python2.7/unittest/case.py", line 116, in __exit__ "{0} not raised".format(exc_name)) AssertionError: XMLSyntaxError not raised Could you advise me what to do next? Should I issue a bug for that? Or is my expectation wrong here that iterparse should raise an exception in case of a schema violation? Best regards, Kai

Kai Hillmann schrieb am 28.07.20 um 10:58:
I'm not sure whether I use the iterparse interface correctly but I did a look inside the tests for it (src/lxml/tests/test_xmlschema.py) and found out that the validation results of iterparse provided with a schema and schema.validate or schema.assertValid using a parsed tree (via etree.parse) differ.
schema = etree.XMLSchema(schema) raw_tree = BytesIO('<a><c></c></a>')
etree.iterparse(raw_tree, schema=schema)
tree = etree.parse(raw_tree)
if not schema.validate(tree):
print('Error: Different validation results:')
schema.assertValid(tree)
Note that etree.iterparse() returns an iterator that parses incrementally. It does not parse the whole input yet all by itself. In order to trigger the parsing (and thus, parsing/validation errors), you have to iterate over it. This the call to list() in the test function. Stefan

Am 28.07.20 um 13:05 schrieb Stefan Behnel:
Kai Hillmann schrieb am 28.07.20 um 10:58:
I'm not sure whether I use the iterparse interface correctly but I did a look inside the tests for it (src/lxml/tests/test_xmlschema.py) and found out that the validation results of iterparse provided with a schema and schema.validate or schema.assertValid using a parsed tree (via etree.parse) differ.
schema = etree.XMLSchema(schema) raw_tree = BytesIO('<a><c></c></a>')
etree.iterparse(raw_tree, schema=schema)
tree = etree.parse(raw_tree)
if not schema.validate(tree):
print('Error: Different validation results:')
schema.assertValid(tree)
Note that etree.iterparse() returns an iterator that parses incrementally. It does not parse the whole input yet all by itself. In order to trigger the parsing (and thus, parsing/validation errors), you have to iterate over it. This the call to list() in the test function.
Thank you very much for the quick response, I overlooked that - you are right, when iterating over the iterparse iterator it is working as expected. Kai
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (2)
-
Kai Hillmann
-
Stefan Behnel