Re: [lxml] Returning Multiple Schema Errors when Using Iterparse

1 May 2017


      Philip Zerull schrieb am 25.04.2017 um 17:33:
...
So I have to validate vary large xml files (gigabytes large) against xsd
schema files.  I have been asked to produce line and column numbers where
the errors occurred in the file.
I'm using:
Python 3.5.2
lxml 3.7.3
libxml2 2.9.3+dfsg1-1ubuntu0.2
Ubuntu 16.04
The following code does a good job of getting a list of all the issues with
the xml file that I can nicely present to the user (for brevity I left out
a bunch of try/catch statements).  This works great because it also gives
the line and column numbers in the xml file where the issues occurred:
schema_root = etree.XML(xsd_file.read())
schema = etree.XMLSchema(schema_root)
parser = etree.XMLParser(schema=schema)
parsed_xml = etree.XML(xml_file.read())
if not schema.validate(parsed_xml):
    errors_list = schema.errors_log
Unfortunately, this results in reading the entire file in memory.
Actually twice, once as a byte sequence (xml_file.read()), and then as an
XML tree (XML()). Use lxml.etree.parse() instead to avoid the wasteful
first step.
...
Alternatively, I can do the following:
schema_root = etree.XML(xsd_file.read())
schema = etree.XMLSchema(schema_root)
for event, elem in etree.iterparse(xml_file, schema=schema):
    #clear stuff from memory
    elem.clear()
    parent = elem.getparent()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
If the xml fails to validate against the xsd file then the iterator
returned by etree.iterparse's next method raises an etree.XMLSyntaxError
for the first xsd issue it finds.
This does a great job of keeping the memory usage constant over even very
large files, but it only tells me if the xsd file is valid or not. It does
not give me a list of issues like the previously mentioned method does.
Try passing recover=True to iterparse(). That will make it continue parsing
as long as it can. Not sure right now what the exact interaction with
validation is, though.

Stefan

Re: [lxml] Returning Multiple Schema Errors when Using Iterparse

Stefan Behnel