[lxml-dev] lxml slower than ElementTree?

Background: John maintains xlrd, the python package for reading Excel files, and is looking to support Microsoft's newer xml based format... John Machin wrote: <snip>
Note that it needs an ElementTree implementation (supplied with more recent Pythons), and tries to find one in various places. Limited testing with lxml gave identical results, but slightly slower, so you could try that instead if you wanted to (would require fiddling with imports in xlsxrd.py)
That surprised me. Would anyone here be interested in taking a look at John's code to see what's tripping up lxml and causing it to be slower? cheers, Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk

Chris Withers, 24.11.2009 07:32:
Background: John maintains xlrd, the python package for reading Excel files, and is looking to support Microsoft's newer xml based format...
John Machin wrote: <snip>
Note that it needs an ElementTree implementation (supplied with more recent Pythons), and tries to find one in various places. Limited testing with lxml gave identical results, but slightly slower, so you could try that instead if you wanted to (would require fiddling with imports in xlsxrd.py)
That surprised me. Would anyone here be interested in taking a look at John's code to see what's tripping up lxml and causing it to be slower?
I didn't look at the code, but you can take a look at http://codespeak.net/lxml/performance.html In general, ET can't compete with lxml.etree, whereas cET can, especially when you stay with code that supports both ET and lxml.etree. Some major differences: - lxml has a fast parser and a fast serialiser. cET has the first but not the latter. ET is straight out. - lxml parses much faster from file names than from open file(-like) objects, especially multi-threaded. ET handles them exactly the same. - lxml can run multi-threaded with great gains, ET benefits very little. - lxml has XPath, ET doesn't. - lxml has XSLT, ET doesn't. - ET creates the tree as Python objects once and for all, lxml creates Python proxies only at request. - ET 1.2 uses a simpler and faster ElementPath implementation than ET 1.3. lxml.etree uses the 1.3 implementation since version 2.x. - Tree iteration using getiterator() in lxml.etree is much faster than in cET, and also much, much faster than using .find(). This strikes even more when searching specific tags, because fewer proxies have to be created. (c)ET doesn't show a difference here. So it is not surprising that code performs worse in lxml.etree if it was tuned for performance using ET - which it likely was, given the quote above. If it had been tuned for lxml.etree, I wouldn't be suprised if it ran faster in absolute numbers, but slower with ET. It's also worth reading this: http://codespeak.net/lxml/performance.html#a-longer-example It might give an idea of how unexpected the performance of an implementation can be. Stefan
participants (2)
-
Chris Withers
-
Stefan Behnel