
Hello everyone, I've been cracking my head about this performance issue I'm having and I could use some help. At my work we have to parse extremely large XML files - 20GB and even larger. The basic algorithm is as follows: with open(file, "rb") as reader: context = etree.iterparse(reader, events=('start', 'end')) for ev, el in context: (processing) el.clear() In Python 2.7, the processing time for a 20GB XML file is approximately 40 minutes. In Python 3.13, it's 7 hours, more than ten times from Python 2. We went through a fine-toothed comb to find the reason why (there were minimal changes in the porting process), and out of desperation I commented out the el.clear() line, and apparently that is the reason - without it, performance in Python 3 matches with 2. Unfortunately when we tested this in a less well-endowed server, the program crashed due to running out of memory (it worked fine with Python 2). I tried substituting el.clear() with del el instead but it did not work - apparently there were still references somewhere, so the garbage collector didn't fire. Questions: 1. What is the difference between the Python 2 and Python 3's implementation of clear()? 2. Is there a way to solve this issue of performance penalty? I tried fast_iter, clearing the root element, re-assigning element to None, nothing works. Any help would be greatly appreciated. Regards,