Hi everyone,

Thank you for your replies.

> I guess this is not a look-alike example but just meant as a hint, right?

Yes. My work is very protective about the source code, so I am only allowed to sketch out the rough approximation. 

The start events are involved during some of the processing, and as you mentioned, element.clear() was not invoked with it.

>  Are you using the same versions of lxml (and libxml2) in both?

No, and that's what makes it so frustrating. I cannot tell management that using the latest version of Python and lxml actually causes a significant performance penalty. By rights using the latest versions should be at least as good as, if not better, than the older version.

>  Does the memory consumption stay constant over time or does it continuously
grow as it parses?

It grows larger until it eventually crashed. My colleague expanded the page file and managed to delay said crash, but it happened eventually.

> Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent

I used Python's cprofile to find the bottlenecks, but unfortunately the results weren't making sense. It identified which functions were taking the most time, but when I did a line-by-line analysis the times didn't add up.

Since commenting out the element.clear() lines did bring the result close the Python 2.7's performance, the rest of the team decided that this is where the issue is.

> the standard library's etree module is often significantly faster,

I have not considered that angle since what I can find on Google indicated that lxml is the fastest; but I'll give this a try.



On Fri, 14 Feb 2025 at 00:21, Charlie Clark <charlie.clark@clark-consulting.eu> wrote:
On 13 Feb 2025, at 15:18, Stefan Behnel via lxml - The Python XML Toolkit wrote:


> Are you using the same versions of lxml (and libxml2) in both?
>
> There shouldn't be a difference in behaviour, except for the obvious language differences (bytes/unicode).

Based on the parsing code we use in Openpyxl, I'd agree with this. NB., we discovered that, for pure parsing, ie. you just want to get at the data, the standard library's etree module is often significantly faster, but YMMV.

> Does the memory consumption stay constant over time or does it continuously grow as it parses?
>
> Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent

Excellent suggestions: memory_profiler and pympler are useful tools for this.

Charlie

--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Sengelsweg 34
Düsseldorf
D- 40489
Tel: +49-203-3925-0390
Mobile: +49-178-782-6226
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-leave@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: noorulamry.daud@gmail.com