Hello everyone,

I've been cracking my head about this performance issue I'm having and I could use some help.

At my work we have to parse extremely large XML files - 20GB and even larger. The basic algorithm is as follows:

with open(file, "rb") as reader:
   context = etree.iterparse(reader, events=('start', 'end'))
   for ev, el in context:
      (processing)
      el.clear()

In Python 2.7, the processing time for a 20GB XML file is approximately 40 minutes.

In Python 3.13, it's 7 hours, more than ten times from Python 2.

We went through a fine-toothed comb to find the reason why (there were minimal changes in the porting process), and out of desperation I commented out the el.clear() line, and apparently that is the reason - without it, performance in Python 3 matches with 2.

Unfortunately when we tested this in a less well-endowed server, the program crashed due to running out of memory (it worked fine with Python 2).

I tried substituting el.clear() with del el instead but it did not work - apparently there were still references somewhere, so the garbage collector didn't fire.

Questions:

1. What is the difference between the Python 2 and Python 3's implementation of clear()?

2. Is there a way to solve this issue of performance penalty? I tried fast_iter, clearing the root element, re-assigning element to None, nothing works.

Any help would be greatly appreciated.

Regards,