Hi everyone,
Thank you for your replies.
>
I guess this is not a look-alike example but just meant as a hint, right?
Yes. My work is very protective about the source code, so I am only allowed to sketch out the rough approximation.
The start events are involved during some of the processing, and as you mentioned, element.clear() was not invoked with it.
>
Are you using the same versions of lxml (and libxml2) in both?
No, and that's what makes it so frustrating. I cannot tell management that using the latest version of Python and lxml actually causes a significant performance penalty. By rights using the latest versions should be at least as good as, if not better, than the older version.
>
Does the memory consumption stay constant over time or does it continuously
grow as it parses?
It grows larger until it eventually crashed. My colleague expanded the page file and managed to delay said crash, but it happened eventually.
> Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent
I used Python's cprofile to find the bottlenecks, but unfortunately the results weren't making sense. It identified which functions were taking the most time, but when I did a line-by-line analysis the times didn't add up.
Since commenting out the element.clear() lines did bring the result close the Python 2.7's performance, the rest of the team decided that this is where the issue is.
>
the standard library's etree module is often significantly faster,
I have not considered that angle since what I can find on Google indicated that lxml is the fastest; but I'll give this a try.