Incremental parsing of XML

Hello,
I've run up against a problem and I've tried to create a "minimal" example to demonstrate. What I'm trying to achieve is to incrementally parse xml files but still use xpath expressions to extract "records" from the xml. A user would provide an xml file, a "record xpath" and a set of "column xpaths". My plan was to read the xml file incrementally by only reading "a" first child node of the root, extract records from that node, discard the node and continue with the next "first child". This works fine for most of the xml files we're importing, except when they get really big. Of course, the whole point of this exercise is to be able to parse *big* files without running out of memory. The problem I'm seeing is that *some* nodes aren't loaded correctly (they are "empty"), but I can't figure out why.
python 3.6.5 lxml 4.3.0.0 libxml 2.9.9 libxslt 1.1.32
If I run the following script:
------8<----------------------------------------------------- from lxml import etree as ET
def create_big_xml(file_path, node_count=1000): with ET.xmlfile(file_path, encoding='utf-8') as fh: fh.write_declaration() with fh.element('root'): for i in range(node_count): c0 = ET.Element('c0') c1 = ET.SubElement(c0, 'c1') c2 = ET.SubElement(c1, 'c2') c2.text = str(i) fh.write(c0)
def iter_xml_records(file_path, record_xpath, column_xpaths): doc = ET.iterparse(file_path, events=('start', 'end')) try: _, root = next(doc) except StopIteration: return start_tag = None for event, element in doc: if event == 'start' and start_tag is None: start_tag = element.tag if event == 'end' and element.tag == start_tag: for record_node in record_xpath(root): yield [xp(record_node) for xp in column_xpaths] start_tag = None root.clear()
fp = 'test-incr.xml' errors = 0 node_count = 100 while not errors: node_count *= 10 create_big_xml(fp, node_count)
rec_xp = ET.XPath('./c0') col_xps = [ ET.XPath('./c1/c2/text()'), ] for i, record in enumerate(iter_xml_records(fp, rec_xp, col_xps)): if record != [[str(i)]]: errors += 1
print(node_count, "nodes:", errors, "errors") ------8<----------------------------------------------------- The output on my machine is:
python bug_lxml.py
1000 nodes: 0 errors 10000 nodes: 5 errors
If I comment out the line "fh.write_declaration()", it takes longer to produce errors:
python bug_lxml.py
1000 nodes: 0 errors 10000 nodes: 0 errors 100000 nodes: 0 errors 1000000 nodes: 411 errors
If your're still with me, thanks for taking the time! Is this a bug in lxml, or am I doing something that I shouldn't?
Regards, Wietse

Wietse Jacobs schrieb am 25.01.19 um 17:44:
I've run up against a problem and I've tried to create a "minimal" example to demonstrate. What I'm trying to achieve is to incrementally parse xml files but still use xpath expressions to extract "records" from the xml. A user would provide an xml file, a "record xpath" and a set of "column xpaths". My plan was to read the xml file incrementally by only reading "a" first child node of the root, extract records from that node, discard the node and continue with the next "first child". This works fine for most of the xml files we're importing, except when they get really big. Of course, the whole point of this exercise is to be able to parse *big* files without running out of memory. The problem I'm seeing is that *some* nodes aren't loaded correctly (they are "empty"), but I can't figure out why.
python 3.6.5 lxml 4.3.0.0 libxml 2.9.9 libxslt 1.1.32
If I run the following script:
------8<----------------------------------------------------- [...] def iter_xml_records(file_path, record_xpath, column_xpaths): doc = ET.iterparse(file_path, events=('start', 'end')) try: _, root = next(doc) except StopIteration: return start_tag = None for event, element in doc: if event == 'start' and start_tag is None: start_tag = element.tag if event == 'end' and element.tag == start_tag: for record_node in record_xpath(root): yield [xp(record_node) for xp in column_xpaths] start_tag = None root.clear()
Well, if you need all elements, then don't throw them away. You are clearing the root node, with whatever content it has up to that point. But you may still need that content.
Read the warning in the docs:
https://lxml.de/parsing.html#modifying-the-tree
Stefan

On Fri, Jan 25, 2019 at 6:15 PM Stefan Behnel stefan_ml@behnel.de wrote:
def iter_xml_records(file_path, record_xpath, column_xpaths): doc = ET.iterparse(file_path, events=('start', 'end')) try: _, root = next(doc) except StopIteration: return start_tag = None for event, element in doc: if event == 'start' and start_tag is None: start_tag = element.tag if event == 'end' and element.tag == start_tag: for record_node in record_xpath(root): yield [xp(record_node) for xp in column_xpaths] start_tag = None root.clear()
Well, if you need all elements, then don't throw them away. You are clearing the root node, with whatever content it has up to that point. But you may still need that content.
Read the warning in the docs:
OK, thanks for that pointer. I'll have to refine the code.
Wietse
participants (2)
-
Stefan Behnel
-
Wietse Jacobs