Hello everyone,
I've been cracking my head about this performance issue I'm having and I
could use some help.
At my work we have to parse extremely large XML files - 20GB and even
larger. The basic algorithm is as follows:
with open(file, "rb") as reader:
context = etree.iterparse(reader, events=('start', 'end'))
for ev, el in context:
(processing)
el.clear()
In Python 2.7, the processing time for a 20GB XML file is approximately 40
minutes.
In Python 3.13, it's 7 hours, …
[View More]more than ten times from Python 2.
We went through a fine-toothed comb to find the reason why (there were
minimal changes in the porting process), and out of desperation I commented
out the el.clear() line, and apparently that is the reason - without it,
performance in Python 3 matches with 2.
Unfortunately when we tested this in a less well-endowed server, the
program crashed due to running out of memory (it worked fine with Python
2).
I tried substituting el.clear() with del el instead but it did not work -
apparently there were still references somewhere, so the garbage collector
didn't fire.
Questions:
1. What is the difference between the Python 2 and Python 3's
implementation of clear()?
2. Is there a way to solve this issue of performance penalty? I tried
fast_iter, clearing the root element, re-assigning element to None, nothing
works.
Any help would be greatly appreciated.
Regards,
[View Less]