Hello everyone,
I've been cracking my head about this performance issue I'm having and I could use some help.
At my work we have to parse extremely large XML files - 20GB and even larger. The basic algorithm is as follows:
with open(file, "rb") as reader:
context = etree.iterparse(reader, events=('start', 'end'))
for ev, el in context:
(processing)
el.clear()
In Python 2.7, the processing time for a 20GB XML file is approximately 40 minutes.
In Python 3.13, it's 7 hours, more than ten times from Python 2.
We went through a fine-toothed comb to find the reason why (there were minimal changes in the porting process), and out of desperation I commented out the el.clear() line, and apparently that is the reason - without it, performance in Python 3 matches with 2.
Unfortunately when we tested this in a less well-endowed server, the program crashed due to running out of memory (it worked fine with Python 2).
I tried substituting el.clear() with del el instead but it did not work - apparently there were still references somewhere, so the garbage collector didn't fire.
Questions:
1. What is the difference between the Python 2 and Python 3's implementation of clear()?
2. Is there a way to solve this issue of performance penalty? I tried fast_iter, clearing the root element, re-assigning element to None, nothing works.
Any help would be greatly appreciated.
Regards,