![](https://secure.gravatar.com/avatar/e5fdd26b49752040a4f876f9ffb5e85e.jpg?s=120&d=mm&r=g)
Hello everyone, I've been cracking my head about this performance issue I'm having and I could use some help. At my work we have to parse extremely large XML files - 20GB and even larger. The basic algorithm is as follows: with open(file, "rb") as reader: context = etree.iterparse(reader, events=('start', 'end')) for ev, el in context: (processing) el.clear() In Python 2.7, the processing time for a 20GB XML file is approximately 40 minutes. In Python 3.13, it's 7 hours, more than ten times from Python 2. We went through a fine-toothed comb to find the reason why (there were minimal changes in the porting process), and out of desperation I commented out the el.clear() line, and apparently that is the reason - without it, performance in Python 3 matches with 2. Unfortunately when we tested this in a less well-endowed server, the program crashed due to running out of memory (it worked fine with Python 2). I tried substituting el.clear() with del el instead but it did not work - apparently there were still references somewhere, so the garbage collector didn't fire. Questions: 1. What is the difference between the Python 2 and Python 3's implementation of clear()? 2. Is there a way to solve this issue of performance penalty? I tried fast_iter, clearing the root element, re-assigning element to None, nothing works. Any help would be greatly appreciated. Regards,
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Hi, Noorulamry Daud schrieb am 13.02.25 um 12:28:
I've been cracking my head about this performance issue I'm having and I could use some help.
At my work we have to parse extremely large XML files - 20GB and even larger. The basic algorithm is as follows:
with open(file, "rb") as reader: context = etree.iterparse(reader, events=('start', 'end')) for ev, el in context: (processing) el.clear()
I guess this is not a look-alike example but just meant as a hint, right? Clearing the elements on both start and end events seems useless, clearing them on start is probably outright dangerous, etc. You would at least want to pass the "keep_tail=True" option and clear them only at the end. https://lxml.de/parsing.html#modifying-the-tree https://lxml.de/parsing.html#incremental-event-parsing
In Python 2.7, the processing time for a 20GB XML file is approximately 40 minutes.
In Python 3.13, it's 7 hours, more than ten times from Python 2.
Are you using the same versions of lxml (and libxml2) in both? There shouldn't be a difference in behaviour, except for the obvious language differences (bytes/unicode). Does the memory consumption stay constant over time or does it continuously grow as it parses? Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent? Stefan
![](https://secure.gravatar.com/avatar/3a93c5a0f975385738c1f27848f3b50a.jpg?s=120&d=mm&r=g)
On 13 Feb 2025, at 15:18, Stefan Behnel via lxml - The Python XML Toolkit wrote:
Are you using the same versions of lxml (and libxml2) in both?
There shouldn't be a difference in behaviour, except for the obvious language differences (bytes/unicode).
Based on the parsing code we use in Openpyxl, I'd agree with this. NB., we discovered that, for pure parsing, ie. you just want to get at the data, the standard library's etree module is often significantly faster, but YMMV.
Does the memory consumption stay constant over time or does it continuously grow as it parses?
Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent
Excellent suggestions: memory_profiler and pympler are useful tools for this. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
![](https://secure.gravatar.com/avatar/e5fdd26b49752040a4f876f9ffb5e85e.jpg?s=120&d=mm&r=g)
Hi everyone, Thank you for your replies.
I guess this is not a look-alike example but just meant as a hint, right?
Yes. My work is very protective about the source code, so I am only allowed to sketch out the rough approximation. The start events are involved during some of the processing, and as you mentioned, element.clear() was not invoked with it.
Are you using the same versions of lxml (and libxml2) in both?
No, and that's what makes it so frustrating. I cannot tell management that using the latest version of Python and lxml actually causes a significant performance penalty. By rights using the latest versions should be at least as good as, if not better, than the older version.
Does the memory consumption stay constant over time or does it continuously grow as it parses?
Have you run a memory profiler on your code? Or a (statistical) line
It grows larger until it eventually crashed. My colleague expanded the page file and managed to delay said crash, but it happened eventually. profiler to see where the time is spent I used Python's cprofile to find the bottlenecks, but unfortunately the results weren't making sense. It identified which functions were taking the most time, but when I did a line-by-line analysis the times didn't add up. Since commenting out the element.clear() lines did bring the result close the Python 2.7's performance, the rest of the team decided that this is where the issue is.
the standard library's etree module is often significantly faster,
I have not considered that angle since what I can find on Google indicated that lxml is the fastest; but I'll give this a try. On Fri, 14 Feb 2025 at 00:21, Charlie Clark < charlie.clark@clark-consulting.eu> wrote:
On 13 Feb 2025, at 15:18, Stefan Behnel via lxml - The Python XML Toolkit wrote:
Are you using the same versions of lxml (and libxml2) in both?
There shouldn't be a difference in behaviour, except for the obvious language differences (bytes/unicode).
Based on the parsing code we use in Openpyxl, I'd agree with this. NB., we discovered that, for pure parsing, ie. you just want to get at the data, the standard library's etree module is often significantly faster, but YMMV.
Does the memory consumption stay constant over time or does it continuously grow as it parses?
Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent
Excellent suggestions: memory_profiler and pympler are useful tools for this.
Charlie
-- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226 _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: noorulamry.daud@gmail.com
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Hi, Noorulamry Daud schrieb am 14.02.25 um 09:56:
Are you using the same versions of lxml (and libxml2) in both?
No, and that's what makes it so frustrating. I cannot tell management that using the latest version of Python and lxml actually causes a significant performance penalty. By rights using the latest versions should be at least as good as, if not better, than the older version.
It should be. This seems more of a memory problem.
Does the memory consumption stay constant over time or does it continuously grow as it parses?
It grows larger until it eventually crashed. My colleague expanded the page file and managed to delay said crash, but it happened eventually.
Then you're not cleaning up enough of the XML tree. Some of it remains in memory after processing it, and thus leads to swapping and long waiting times. Try to find out how the tree looks after a few iterations. You're collecting "start" events, so grab the first returned element (that's the root element) and print its tostring() after each ".clean()" call. That should show you what data you're missing in the cleanup.
Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent
I used Python's cprofile to find the bottlenecks, but unfortunately the results weren't making sense. It identified which functions were taking the most time, but when I did a line-by-line analysis the times didn't add up.
That's not unusual. Line profiling takes additional time *per line*, so the results are often different from simple *per function* timings. Statistical profilers are much better than cProfile here since they add less overhead.
Since commenting out the element.clear() lines did bring the result close the Python 2.7's performance, the rest of the team decided that this is where the issue is.
Sort-of, but probably for other reasons.
the standard library's etree module is often significantly faster,
I have not considered that angle since what I can find on Google indicated that lxml is the fastest; but I'll give this a try.
I can second that. It uses a different parser and in-memory model, so the performance is different – better for some things, worse for others. Try it to see where your code ends up. Note that the feature set is also very different, though. lxml adds lots of functionality that "xml.etree" cannot provide. Stefan
![](https://secure.gravatar.com/avatar/3a93c5a0f975385738c1f27848f3b50a.jpg?s=120&d=mm&r=g)
On 14 Feb 2025, at 11:12, Stefan Behnel via lxml - The Python XML Toolkit wrote:
Then you're not cleaning up enough of the XML tree. Some of it remains in memory after processing it, and thus leads to swapping and long waiting times.
It's definitely a memory issue. You can write some code to catch memory use quickly. This is something we wrote for openpyxl while we trying to "contain" memory use: ```python import os import openpyxl from memory_profiler import memory_usage def test_memory_use(): """Naive test that assumes memory use will never be more than 120 % of that for first 50 rows""" folder = os.path.split(__file__)[0] src = os.path.join(folder, "files", "very_large.xlsx") wb = openpyxl.load_workbook(src, read_only=True) ws = wb.active initial_use = None for n, line in enumerate(ws.iter_rows(values_only=True)): if n % 50 == 0: use = memory_usage(proc=-1, interval=1)[0] if initial_use is None: initial_use = use assert use/initial_use < 1.2 print(n, use) if __name__ == '__main__': test_memory_use() ``` You should be able to adapt this for your parser and it'll tell you soon enough how far in you get before your memory use balloons. If memory serves I had one problem where I was clearing in the wrong place, which meant that other elements were sticking around. Thanks to Stefan for helping me sort it. I think your code made be too aggressive. It might help to look at the Openpyxl worksheet parser which has to handle what happens if you do additional processing within nodes. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
participants (3)
-
Charlie Clark
-
Noorulamry Daud
-
Stefan Behnel