Incremental parsing of XML

Hello, I've run up against a problem and I've tried to create a "minimal" example to demonstrate. What I'm trying to achieve is to incrementally parse xml files but still use xpath expressions to extract "records" from the xml. A user would provide an xml file, a "record xpath" and a set of "column xpaths". My plan was to read the xml file incrementally by only reading "a" first child node of the root, extract records from that node, discard the node and continue with the next "first child". This works fine for most of the xml files we're importing, except when they get really big. Of course, the whole point of this exercise is to be able to parse *big* files without running out of memory. The problem I'm seeing is that *some* nodes aren't loaded correctly (they are "empty"), but I can't figure out why. python 3.6.5 lxml 4.3.0.0 libxml 2.9.9 libxslt 1.1.32 If I run the following script: ------8<----------------------------------------------------- from lxml import etree as ET def create_big_xml(file_path, node_count=1000): with ET.xmlfile(file_path, encoding='utf-8') as fh: fh.write_declaration() with fh.element('root'): for i in range(node_count): c0 = ET.Element('c0') c1 = ET.SubElement(c0, 'c1') c2 = ET.SubElement(c1, 'c2') c2.text = str(i) fh.write(c0) def iter_xml_records(file_path, record_xpath, column_xpaths): doc = ET.iterparse(file_path, events=('start', 'end')) try: _, root = next(doc) except StopIteration: return start_tag = None for event, element in doc: if event == 'start' and start_tag is None: start_tag = element.tag if event == 'end' and element.tag == start_tag: for record_node in record_xpath(root): yield [xp(record_node) for xp in column_xpaths] start_tag = None root.clear() fp = 'test-incr.xml' errors = 0 node_count = 100 while not errors: node_count *= 10 create_big_xml(fp, node_count) rec_xp = ET.XPath('./c0') col_xps = [ ET.XPath('./c1/c2/text()'), ] for i, record in enumerate(iter_xml_records(fp, rec_xp, col_xps)): if record != [[str(i)]]: errors += 1 print(node_count, "nodes:", errors, "errors") ------8<----------------------------------------------------- The output on my machine is:
If I comment out the line "fh.write_declaration()", it takes longer to produce errors:
If your're still with me, thanks for taking the time! Is this a bug in lxml, or am I doing something that I shouldn't? Regards, Wietse

Wietse Jacobs schrieb am 25.01.19 um 17:44:
Well, if you need all elements, then don't throw them away. You are clearing the root node, with whatever content it has up to that point. But you may still need that content. Read the warning in the docs: https://lxml.de/parsing.html#modifying-the-tree Stefan

Wietse Jacobs schrieb am 25.01.19 um 17:44:
Well, if you need all elements, then don't throw them away. You are clearing the root node, with whatever content it has up to that point. But you may still need that content. Read the warning in the docs: https://lxml.de/parsing.html#modifying-the-tree Stefan
participants (2)
-
Stefan Behnel
-
Wietse Jacobs