[lxml-dev] [PATCH] cElementTree compat for bazaaz-ng
![](https://secure.gravatar.com/avatar/f4000f809d992bcd8d75b0c5ebbaec6a.jpg?s=120&d=mm&r=g)
Hi list, I have identified 3 incompatibility issues that prevented lxml.etree to be used in place of Fredrik's cElementTree (or ElementTree) module in for the bazaar-ng SCM system. I attach a unittest test suite (test_grisel.py) that highlights those issues. It is meant to be added to the unittests area of the testcase area on codespeak: http://codespeak.net/svn/lxml/testcase/unittests/ However, it is runs perfectly as standalone script (you just need lxml and optionally cElementTree in your PYTHONPATH). Change the 'import' lines to feel the difference :) And here is the good news: the second attached file is a patch against the current trunk (rev 13608) to make lxml compatible enough to make those tests pass and bzr's test suite as well :) However, I used the basic self._c_doc replacement and I have no clue whether or not this will cause memory leaks (I don't know pyrex nor libxml2 at all). And now the benchmarks: % rsync -av --delete bazaar-ng.org::bazaar-ng/bzr/bzr.dev /tmp/ % cp -r /tmp/bzr.dev/ /tmp/bzr.dev-lxml/ % perl -p -i -e 's/from cElementTree/from lxml.etree/g' /tmp/bzr.dev-lxml/bzrlib/*.py # bzr log with cElementTree % cd /tmp/bzr.dev/ % time python ./bzr log > /dev/null real 0m0.524s user 0m0.481s sys 0m0.037s # bzr log with lxml.etree % cd /tmp/bzr.dev-lxml/ % time python ./bzr log > /dev/null real 0m0.394s user 0m0.338s sys 0m0.049s I did repeat the measure 10 times each, and the results where almost always the same. So lxml is a bit faster on this case. Regards, -- Olivier
![](https://secure.gravatar.com/avatar/f4000f809d992bcd8d75b0c5ebbaec6a.jpg?s=120&d=mm&r=g)
The testcase is now uploaded in the test area: http://codespeak.net/svn/lxml/testcase/unittests/test_grisel.py regards, -- Olivier
![](https://secure.gravatar.com/avatar/f4000f809d992bcd8d75b0c5ebbaec6a.jpg?s=120&d=mm&r=g)
Hi again, I have found another quirk in lxml.etree.parse: it should be able to open file-like objects that don't have a name or filename attribute (or have an empty string), eg: urlgrabber.urlopen('http://python.org/channews.rdf') This is used in bzr to branch from a remote (http) repository. Thus I updated the testcase [1] and the previous patch [attached] to handle this case. Some more (meaningless) benchmarks: ogrisel@localhost:~/Developments $ time python bzr.dev-lxml/bzr branch bzr.dev bzr.test-branch-lxml Added 1210 texts. Added 720 inventories. Added 720 revisions. real 0m27.188s user 0m22.039s sys 0m2.325s ogrisel@localhost:~/Developments $ time python bzr.dev/bzr branch bzr.dev bzr.test-branch-cetree Added 1210 texts. Added 720 inventories. Added 720 revisions. real 0m19.436s user 0m16.907s sys 0m2.076s On this case (local branching) the cElementTree version is significantly faster. I don't have the time to investigate why this benchmark gives a so different result compared to the 'bzr log' case. [1] http://codespeak.net/svn/lxml/testcase/unittests/test_grisel.py regards, -- Olivier
![](https://secure.gravatar.com/avatar/947c14f160f3305b9e7f54c0cc9708a1.jpg?s=120&d=mm&r=g)
Olivier Grisel wrote:
Hi again,
I have found another quirk in lxml.etree.parse: it should be able to open file-like objects that don't have a name or filename attribute (or have an empty string), eg: urlgrabber.urlopen('http://python.org/channews.rdf')
This is used in bzr to branch from a remote (http) repository. Thus I updated the testcase [1] and the previous patch [attached] to handle this case.
Some more (meaningless) benchmarks:
ogrisel@localhost:~/Developments $ time python bzr.dev-lxml/bzr branch bzr.dev bzr.test-branch-lxml Added 1210 texts. Added 720 inventories. Added 720 revisions.
real 0m27.188s user 0m22.039s sys 0m2.325s ogrisel@localhost:~/Developments $ time python bzr.dev/bzr branch bzr.dev bzr.test-branch-cetree Added 1210 texts. Added 720 inventories. Added 720 revisions.
real 0m19.436s user 0m16.907s sys 0m2.076s
On this case (local branching) the cElementTree version is significantly faster. I don't have the time to investigate why this benchmark gives a so different result compared to the 'bzr log' case.
Thanks for the new patch. This stuff is really very cool! I think such real-world benchmarking is quite interesting. I'd be interesting if you could do some profiling run against bzr branch to see what in lxml (or cElementTree) ends up taking up time. Regards, Martijn
![](https://secure.gravatar.com/avatar/f4000f809d992bcd8d75b0c5ebbaec6a.jpg?s=120&d=mm&r=g)
Martijn Faassen wrote:
Thanks for the new patch. This stuff is really very cool! I think such real-world benchmarking is quite interesting. I'd be interesting if you could do some profiling run against bzr branch to see what in lxml (or cElementTree) ends up taking up time.
Here, is a sample profiling experiment (BZR + lxml local branch operation) with hotshot: - the script: http://codespeak.net/svn/lxml/testcase/grisel/profiling/profile_bzr.py - the results: ogrisel@groyours:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py Added 1179 texts. Added 695 inventories. Added 695 revisions. 1030951 function calls (948981 primitive calls) in 9.718 CPU seconds Ordered by: internal time, call count List reduced from 470 to 10 due to restriction <'.*lxml.*'> ncalls tottime percall cumtime percall filename:lineno(function) 696 0.009 0.000 0.009 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:98(find) 696 0.009 0.000 0.020 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:180(find) 696 0.008 0.000 0.016 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:186(findtext) 696 0.006 0.000 0.006 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:113(findtext) 1392 0.005 0.000 0.005 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:167(_compile) 1 0.000 0.000 0.003 0.003 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:49(?) 2 0.000 0.000 0.000 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:66(__init__) 1 0.000 0.000 0.000 0.000 /usr/lib/python2.4/site-packages/lxml/__init__.py:0(?) 1 0.000 0.000 0.000 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:61(Path) 1 0.000 0.000 0.000 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:55(xpath_descendant_or_self) Note that if I remove the restriction '.*lxml.*', only bzrlib and gzip functions appear in the top 20. Maybe I'm doing something wrong. Please feel free to change the script. Please -- Olivier
![](https://secure.gravatar.com/avatar/947c14f160f3305b9e7f54c0cc9708a1.jpg?s=120&d=mm&r=g)
Olivier Grisel wrote:
Martijn Faassen wrote:
Thanks for the new patch. This stuff is really very cool! I think such real-world benchmarking is quite interesting. I'd be interesting if you could do some profiling run against bzr branch to see what in lxml (or cElementTree) ends up taking up time.
Here, is a sample profiling experiment (BZR + lxml local branch operation) with hotshot:
- the script: http://codespeak.net/svn/lxml/testcase/grisel/profiling/profile_bzr.py - the results:
ogrisel@groyours:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py Added 1179 texts. Added 695 inventories. Added 695 revisions. 1030951 function calls (948981 primitive calls) in 9.718 CPU seconds
[snip results]
Note that if I remove the restriction '.*lxml.*', only bzrlib and gzip functions appear in the top 20. Maybe I'm doing something wrong. Please feel free to change the script.
Hm, interesting. These calls .find() calls should almost all be replaceable by true xpath calls instead (if you figure out the right XPath expressions). That should speed up things a bit. Then again, as you say, this seems to be only a minimal component of the actual call time. I'm a bit surprised I only find _elementpath.py stuff here (that's the .find() implementation that's identical with ElementTree's; it's already known lxml's is somewhat slower than cElementTree's execution of this). That implementation itself will definitely use 'tag' and such in lxml, so I'm surprised that such calls in lxml.etree aren't showing up at all (though extension modules aren't profiled so perhaps that's the cause). There must be *some* reason lxml is slower than cElementTree on this, and the reason can't really be truly in bzrlib and gzip.. I don't think I'll capable of doing much on this testing myself until EuroPython is over. So if you feel like doing some more profiling before then, I'm extremely interested in finding out more. :) Regards, Martijn
![](https://secure.gravatar.com/avatar/f4000f809d992bcd8d75b0c5ebbaec6a.jpg?s=120&d=mm&r=g)
Martijn Faassen wrote:
Hm, interesting. These calls .find() calls should almost all be replaceable by true xpath calls instead (if you figure out the right XPath expressions). That should speed up things a bit.
I would first need to learn xpath ;) Maybe one day, if I can get some spare time.
Then again, as you say, this seems to be only a minimal component of the actual call time. I'm a bit surprised I only find _elementpath.py stuff here (that's the .find() implementation that's identical with ElementTree's; it's already known lxml's is somewhat slower than cElementTree's execution of this). That implementation itself will definitely use 'tag' and such in lxml, so I'm surprised that such calls in lxml.etree aren't showing up at all (though extension modules aren't profiled so perhaps that's the cause).
Yes but I thought hotshot was designed to take care of C extension modules. I need to google a bit more so as to understand how to make the C/Pyrex functions call appear somehow in the stats since we are comparing two C extensions based libraries.
There must be *some* reason lxml is slower than cElementTree on this, and the reason can't really be truly in bzrlib and gzip..
No, sure. The bzrlib ans gzip codes remain the same in both case (cElementTree and lxml).
I don't think I'll capable of doing much on this testing myself until EuroPython is over. So if you feel like doing some more profiling before then, I'm extremely interested in finding out more. :)
Yes I'll try to but I don't have much spare time either. -- Olivier
participants (2)
-
Martijn Faassen
-
Olivier Grisel