Re: [lxml] Efficient incremental parsing using etree.iterparse

Nov. 24, 2014

      D.H.J. Takken schrieb am 24.11.2014 um 09:51:
...
On 11/21/2014 08:31 PM, Charlie Clark wrote:
...
As noted elsewhere you can pass in a list of tags to do this. However,
when running benchmarks in openpyxl we discovered that for pure parsing
xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3
times in our experience. I discussed this with Stefan and he said it's
largely down to the different c libraries – you pay a penalty for the
richer interface of libmxml2.
Is cET still faster when every single tag is yielded by the iterator?
The cET iterparse implementation does not appear to feature tag
filtering, so I need to use Python if statements to do the filtering
myself. I can imagine that this pretty much defeats the performance
advantage...
You should try it. It can make a difference as it reduces the amount of
interpreted code being executed, but in the back, lxml would still
instantiate the Element objects (if only for tree memory management), so
don't expect wonders. It really depends a lot on your code and the task at
hand, including the work you do during each iteration step for actual tree
processing.

Selective iterparsing is one of the specific cases where optimising lxml's
iterparse implementation could show a lot of improvement, though, as it
could then reduce the number of instantiated Element objects to what's
really requested on user side.

Stefan

Re: [lxml] Efficient incremental parsing using etree.iterparse

Stefan Behnel