
D.H.J. Takken schrieb am 24.11.2014 um 09:51:
On 11/21/2014 08:31 PM, Charlie Clark wrote:
As noted elsewhere you can pass in a list of tags to do this. However, when running benchmarks in openpyxl we discovered that for pure parsing xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3 times in our experience. I discussed this with Stefan and he said it's largely down to the different c libraries – you pay a penalty for the richer interface of libmxml2.
Is cET still faster when every single tag is yielded by the iterator? The cET iterparse implementation does not appear to feature tag filtering, so I need to use Python if statements to do the filtering myself. I can imagine that this pretty much defeats the performance advantage...
You should try it. It can make a difference as it reduces the amount of interpreted code being executed, but in the back, lxml would still instantiate the Element objects (if only for tree memory management), so don't expect wonders. It really depends a lot on your code and the task at hand, including the work you do during each iteration step for actual tree processing. Selective iterparsing is one of the specific cases where optimising lxml's iterparse implementation could show a lot of improvement, though, as it could then reduce the number of instantiated Element objects to what's really requested on user side. Stefan