Mailman 3 Incremental parsing of XML - lxml - The Python XML Toolkit

Jan. 25, 2019

      Hello,

I've run up against a problem and I've tried to create a "minimal" example
to demonstrate. What I'm trying to achieve is to incrementally parse xml
files but still use xpath expressions to extract "records" from the xml. A
user would provide an xml file, a "record xpath" and a set of "column
xpaths". My plan was to read the xml file incrementally by only reading "a"
first child node of the root, extract records from that node, discard the
node and continue with the next "first child". This works fine for most of
the xml files we're importing, except when they get really big. Of course,
the whole point of this exercise is to be able to parse *big* files without
running out of memory.
The problem I'm seeing is that *some* nodes aren't loaded correctly (they
are "empty"), but I can't figure out why.

python 3.6.5
lxml 4.3.0.0
libxml 2.9.9
libxslt 1.1.32

If I run the following script:

------8<-----------------------------------------------------
from lxml import etree as ET

def create_big_xml(file_path, node_count=1000):
    with ET.xmlfile(file_path, encoding='utf-8') as fh:
        fh.write_declaration()
        with fh.element('root'):
            for i in range(node_count):
                c0 = ET.Element('c0')
                c1 = ET.SubElement(c0, 'c1')
                c2 = ET.SubElement(c1, 'c2')
                c2.text = str(i)
                fh.write(c0)

def iter_xml_records(file_path, record_xpath, column_xpaths):
    doc = ET.iterparse(file_path, events=('start', 'end'))
    try:
        _, root = next(doc)
    except StopIteration:
        return
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            for record_node in record_xpath(root):
                yield [xp(record_node) for xp in column_xpaths]
            start_tag = None
            root.clear()

fp = 'test-incr.xml'
errors = 0
node_count = 100
while not errors:
    node_count *= 10
    create_big_xml(fp, node_count)

    rec_xp = ET.XPath('./c0')
    col_xps = [
        ET.XPath('./c1/c2/text()'),
    ]
    for i, record in enumerate(iter_xml_records(fp, rec_xp, col_xps)):
        if record != [[str(i)]]:
            errors += 1

    print(node_count, "nodes:", errors, "errors")
------8<-----------------------------------------------------
The output on my machine is:
...
...
...
python bug_lxml.py
1000 nodes: 0 errors
10000 nodes: 5 errors
If I comment out the line "fh.write_declaration()", it takes longer to
produce errors:
...
...
...
python bug_lxml.py
1000 nodes: 0 errors
10000 nodes: 0 errors
100000 nodes: 0 errors
1000000 nodes: 411 errors
If your're still with me, thanks for taking the time!
Is this a bug in lxml, or am I doing something that I shouldn't?

Regards,
Wietse

Incremental parsing of XML

Wietse Jacobs

Stefan Behnel

Wietse Jacobs

Stefan Behnel

Wietse Jacobs

tags

participants (2)