Re: [lxml-dev] lxml slower than ElementTree?

Hi, John Machin, 24.11.2009 14:00:
Thanks for your response, Stefan.
You're welcome.
On 24/11/2009 7:48 PM, Stefan Behnel wrote:
In general, ET can't compete with lxml.etree, whereas cET can, especially when you stay with code that supports both ET and lxml.etree.
As it stands, the code tries to import xml.etree.cElementTree, then cElementTree, then ElementTree. I plan to allow an override option, where the caller can import any minimal-ET-subset-compliant module (e.g. lxml) and pass it in. The mentioned speed comparison was using cET.
- lxml parses much faster from file names than from open file(-like) objects, especially multi-threaded. ET handles them exactly the same.
The MS Excel 2007 file format is a ZIP file containing compressed XML documents. Hence one ends up parsing file-like objects.
Sure, I expected something like that. The thing is just that parsing from file-like object from lxml means that it needs to walk all the way up the Python stack to look up and call the object's .read() method for each chunk of data, which then builds and returns a new Python string object. Parsing from a file path implies straight calls to C's file reading function, which works completely outside of the GIL. So there is a pretty huge performance penalty for file-like objects. This penalty applies to (c)ET as well, though, so that's not a reason for lxml to be slower than them.
For better or worse, ET comes with Python. Design goal was to avoid requiring Nth party modules where appropriate
I totally understand that design goal. Being in the stdlib is clearly a major advantage of ET.
- ET creates the tree as Python objects once and for all, lxml creates Python proxies only at request.
I acknowledge that ET may start swapping much sooner than lxml
That's not what I was referring to, though. It's ok to call ET an elephant, but cET is actually *very* memory friendly. I meant to say that lxml.etree implies a performance penalty if you access many elements in the tree, where cET can just return their reference. The penalty is not large, but it's certainly worth being a bit more selective when walking the tree and searching elements. That's what XPath and getiterator(tagname) are great for.
- Tree iteration using getiterator() in lxml.etree is much faster than in cET, and also much, much faster than using .find(). This strikes even more when searching specific tags, because fewer proxies have to be created. (c)ET doesn't show a difference here.
A base class provides a controller method that uses getiterator(tag=None) and a mapping from tags to methods. Each type of document (about 5 types) has a subclass.
I would expect that to be pretty fast on cET. For lxml.etree, however, it might still be faster to traverse the document independently for each interesting tag name - as long as the document fits into memory completely, and as long as the methods are independent from each other. The benefit certainly depends on the ratio of interesting versus ignored elements in the traversal, but given that lxml's tight traversal loop is easily an order of magnitude faster than even cET, being selective can really turn the vane here.
The area where heavy lifting is required is the worksheet document, which contains cell elements as children of row elements. Max 2**20 rows and max 2**14 cells per row. If iterparse is available, a specialised controller method is used; it uses iterparse to iterate over row elements (only "end" events), clearing each row as it is finished. This should solve most of the memory problem for [c]ET. Extending this to clear the root element (i.e. avoid leaving empty row elements lying about) is a possibility.
Too bad that a) lxml's iterparse() is slightly slower than the one in cET and b) clearing the root element doesn't work in lxml.etree I guess a) is where your initial comment mainly originated from. Regarding b), getting your hands at the root element in (c)ET would require you to also accept 'start' events, which usually results in such a huge performance drop for larger documents that it's almost always better to just leave the dead elements around instead. Also note that lxml.etree would allow you to drop them during the iteration by removing the preceding siblings of the current row element through their common parent element. So you can have the cake and eat it, too. :) Stefan
participants (1)
-
Stefan Behnel