Philip Zerull schrieb am 25.04.2017 um 17:33:
> So I have to validate vary large xml files (gigabytes large) against xsd
> schema files. I have been asked to produce line and column numbers where
> the errors occurred in the file.
>
> I'm using:
>
> Python 3.5.2
> lxml 3.7.3
> libxml2 2.9.3+dfsg1-1ubuntu0.2
> Ubuntu 16.04
>
> The following code does a good job of getting a list of all the issues with
> the xml file that I can nicely present to the user (for brevity I left out
> a bunch of try/catch statements). This works great because it also gives
> the line and column numbers in the xml file where the issues occurred:
>
> schema_root = etree.XML(xsd_file.read())
> schema = etree.XMLSchema(schema_root)
> parser = etree.XMLParser(schema=schema)
> parsed_xml = etree.XML(xml_file.read())
> if not schema.validate(parsed_xml):
> errors_list = schema.errors_log
>
> Unfortunately, this results in reading the entire file in memory.
Actually twice, once as a byte sequence (xml_file.read()), and then as an
XML tree (XML()). Use lxml.etree.parse() instead to avoid the wasteful
first step.
> Alternatively, I can do the following:
>
> schema_root = etree.XML(xsd_file.read())
> schema = etree.XMLSchema(schema_root)
> for event, elem in etree.iterparse(xml_file, schema=schema):
> #clear stuff from memory
> elem.clear()
> parent = elem.getparent()
> while elem.getprevious() is not None:
> del elem.getparent()[0]
>
>
> If the xml fails to validate against the xsd file then the iterator
> returned by etree.iterparse's next method raises an etree.XMLSyntaxError
> for the first xsd issue it finds.
>
> This does a great job of keeping the memory usage constant over even very
> large files, but it only tells me if the xsd file is valid or not. It does
> not give me a list of issues like the previously mentioned method does.
Try passing recover=True to iterparse(). That will make it continue parsing
as long as it can. Not sure right now what the exact interaction with
validation is, though.
Stefan
____________________________________________________________ _____
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml