Returning Multiple Schema Errors when Using Iterparse
Hi everyone, So I have to validate vary large xml files (gigabytes large) against xsd schema files. I have been asked to produce line and column numbers where the errors occurred in the file. I'm using: Python 3.5.2 lxml 3.7.3 libxml2 2.9.3+dfsg1-1ubuntu0.2 Ubuntu 16.04 The following code does a good job of getting a list of all the issues with the xml file that I can nicely present to the user (for brevity I left out a bunch of try/catch statements). This works great because it also gives the line and column numbers in the xml file where the issues occurred: schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) parser = etree.XMLParser(schema=schema) parsed_xml = etree.XML(xml_file.read()) if not schema.validate(parsed_xml): errors_list = schema.errors_log Unfortunately, this results in reading the entire file in memory. Alternatively, I can do the following: schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) for event, elem in etree.iterparse(xml_file, schema=schema): #clear stuff from memory elem.clear() parent = elem.getparent() while elem.getprevious() is not None: del elem.getparent()[0] If the xml fails to validate against the xsd file then the iterator returned by etree.iterparse's next method raises an etree.XMLSyntaxError for the first xsd issue it finds. This does a great job of keeping the memory usage constant over even very large files, but it only tells me if the xsd file is valid or not. It does not give me a list of issues like the previously mentioned method does. I tried just catching the XMLSyntaxError by surrounding only the iterator's next method with a try/catch and attempting to move on with the file even if the error was raised, but that just resulted in an infinite loop where the iterator's next method continuously raised the same error for the same element and stopped progressing through the file. It's very tempting to think that there must be a way to get the best of both worlds here, but I've scoured the documentation, tried a few things, as well as briefly dug into the code a bit but I haven't had any luck yet. I'm not even sure doing this would be possible using iterparse. Any advice in this area would be greatly appreciated. Regards, Phil Zerull
Philip Zerull schrieb am 25.04.2017 um 17:33:
So I have to validate vary large xml files (gigabytes large) against xsd schema files. I have been asked to produce line and column numbers where the errors occurred in the file.
I'm using:
Python 3.5.2 lxml 3.7.3 libxml2 2.9.3+dfsg1-1ubuntu0.2 Ubuntu 16.04
The following code does a good job of getting a list of all the issues with the xml file that I can nicely present to the user (for brevity I left out a bunch of try/catch statements). This works great because it also gives the line and column numbers in the xml file where the issues occurred:
schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) parser = etree.XMLParser(schema=schema) parsed_xml = etree.XML(xml_file.read()) if not schema.validate(parsed_xml): errors_list = schema.errors_log
Unfortunately, this results in reading the entire file in memory.
Actually twice, once as a byte sequence (xml_file.read()), and then as an XML tree (XML()). Use lxml.etree.parse() instead to avoid the wasteful first step.
Alternatively, I can do the following:
schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) for event, elem in etree.iterparse(xml_file, schema=schema): #clear stuff from memory elem.clear() parent = elem.getparent() while elem.getprevious() is not None: del elem.getparent()[0]
If the xml fails to validate against the xsd file then the iterator returned by etree.iterparse's next method raises an etree.XMLSyntaxError for the first xsd issue it finds.
This does a great job of keeping the memory usage constant over even very large files, but it only tells me if the xsd file is valid or not. It does not give me a list of issues like the previously mentioned method does.
Try passing recover=True to iterparse(). That will make it continue parsing as long as it can. Not sure right now what the exact interaction with validation is, though. Stefan
Thanks Stefan, I tried passing recover=True but as soon as it hit the first schema error each successive call to the next method results in the same error being raised over and over again and the iterator does not proceed to the next element: Below is the code I used: schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) iterator = etree.iterparse(xml_file, schema=schema, recover=True) error_list = [] while True: try: event, elem = next(iterator) elem.clear() parent = elem.getparent() while elem.getprevious() is not None: del elem.getparent()[0] except StopIteration: break except etree.XMLSyntaxError as err: print(elem) print(err) print('---------------------------') error_list.append(err) if len(error_list) >= 100: break Regards, Phil Zerull On Mon, May 1, 2017 at 10:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Philip Zerull schrieb am 25.04.2017 um 17:33:
So I have to validate vary large xml files (gigabytes large) against xsd schema files. I have been asked to produce line and column numbers where the errors occurred in the file.
I'm using:
Python 3.5.2 lxml 3.7.3 libxml2 2.9.3+dfsg1-1ubuntu0.2 Ubuntu 16.04
The following code does a good job of getting a list of all the issues with the xml file that I can nicely present to the user (for brevity I left out a bunch of try/catch statements). This works great because it also gives the line and column numbers in the xml file where the issues occurred:
schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) parser = etree.XMLParser(schema=schema) parsed_xml = etree.XML(xml_file.read()) if not schema.validate(parsed_xml): errors_list = schema.errors_log
Unfortunately, this results in reading the entire file in memory.
Actually twice, once as a byte sequence (xml_file.read()), and then as an XML tree (XML()). Use lxml.etree.parse() instead to avoid the wasteful first step.
Alternatively, I can do the following:
schema_root = etree.XML(xsd_file.read()) schema = etree.XMLSchema(schema_root) for event, elem in etree.iterparse(xml_file, schema=schema): #clear stuff from memory elem.clear() parent = elem.getparent() while elem.getprevious() is not None: del elem.getparent()[0]
If the xml fails to validate against the xsd file then the iterator returned by etree.iterparse's next method raises an etree.XMLSyntaxError for the first xsd issue it finds.
This does a great job of keeping the memory usage constant over even very large files, but it only tells me if the xsd file is valid or not. It does not give me a list of issues like the previously mentioned method does.
Try passing recover=True to iterparse(). That will make it continue parsing as long as it can. Not sure right now what the exact interaction with validation is, though.
Stefan
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (2)
-
Philip Zerull
-
Stefan Behnel