Hi list, I updated my script to use the 'profile' module instead of 'hotshot' since 'profile' displays calls to C functions and builtins. Before, actually profiling, I wanted to check that cElementTree was actually faster on the local branching operation by performing it three times (only on the first 100 revisions of BZR instead of more than 700 at the time of writing): ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py -a timeit -x lxml Timing against /tmp/bzr.dev-lxml/bzrlib/__init__.pyc branching took 2.34565591812 ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py -a timeit -x cetree Timing against /tmp/bzr.dev/bzrlib/__init__.pyc branching took 1.86508107185 So cElementTree is faster on the local branching operation. Now here is the profile of the same operation in both cases: ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py -a profile -x lxml Profiling against /tmp/bzr.dev-lxml/bzrlib/__init__.pyc Added 230 texts. Added 100 inventories. Added 100 revisions. Thu Jun 23 23:42:22 2005 /tmp/bzr_data.profile 385004 function calls (374095 primitive calls) in 5.700 CPU seconds Ordered by: internal time, call count List reduced from 429 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 64844 0.860 0.000 0.860 0.000 :0(get) 430 0.520 0.001 0.520 0.001 :0(compress) 8992 0.430 0.000 1.310 0.000 /tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element) 21611 0.390 0.000 0.600 0.000 /usr/lib/python2.4/posixpath.py:56(join) 17627/9095 0.230 0.000 0.470 0.000 /tmp/bzr.dev-lxml/bzrlib/inventory.py:317(iter_entries) 9009 0.190 0.000 0.770 0.000 /tmp/bzr.dev-lxml/bzrlib/store.py:132(__contains__) 10165 0.170 0.000 0.490 0.000 /tmp/bzr.dev-lxml/bzrlib/store.py:69(_path) 204 0.160 0.001 0.160 0.001 :0(parse) 18878 0.160 0.000 0.160 0.000 :0(access) 22295 0.130 0.000 0.130 0.000 :0(startswith) 1 0.130 0.130 4.730 4.730 /tmp/bzr.dev-lxml/bzrlib/branch.py:727(update_revisions) 3601/1238 0.120 0.000 0.540 0.000 /tmp/bzr.dev-lxml/bzrlib/changeset.py:682(get_new_path) 9977 0.110 0.000 0.230 0.000 /usr/lib/python2.4/posixpath.py:74(split) 430 0.110 0.000 0.110 0.000 :0(flush) 103 0.110 0.001 1.540 0.015 /tmp/bzr.dev-lxml/bzrlib/inventory.py:487(from_element) 1218 0.110 0.000 0.110 0.000 :0(decompress) 9090 0.100 0.000 0.120 0.000 /tmp/bzr.dev-lxml/bzrlib/inventory.py:408(add) 3184 0.090 0.000 0.090 0.000 :0(seek) 1836 0.080 0.000 0.480 0.000 /usr/lib/python2.4/gzip.py:242(_read) 22160 0.080 0.000 0.080 0.000 :0(endswith) ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python profile_bzr.py -a profile -x cetree Profiling against /tmp/bzr.dev/bzrlib/__init__.pyc Added 230 texts. Added 100 inventories. Added 100 revisions. Thu Jun 23 23:42:44 2005 /tmp/bzr_data.profile 404062 function calls (393056 primitive calls) in 5.120 CPU seconds Ordered by: internal time, call count List reduced from 431 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 430 0.560 0.001 0.560 0.001 :0(compress) 21611 0.460 0.000 0.650 0.000 /usr/lib/python2.4/posixpath.py:56(join) 8992 0.350 0.000 0.670 0.000 /tmp/bzr.dev/bzrlib/inventory.py:189(from_element) 64642 0.310 0.000 0.310 0.000 :0(get) 1 0.250 0.250 3.990 3.990 /tmp/bzr.dev/bzrlib/branch.py:727(update_revisions) 9977 0.220 0.000 0.310 0.000 /usr/lib/python2.4/posixpath.py:74(split) 18878 0.200 0.000 0.200 0.000 :0(access) 17627/9095 0.170 0.000 0.480 0.000 /tmp/bzr.dev/bzrlib/inventory.py:317(iter_entries) 204 0.160 0.001 0.310 0.002 :0(_parse) 9009 0.140 0.000 0.650 0.000 /tmp/bzr.dev/bzrlib/store.py:132(__contains__) 3601/1238 0.130 0.000 0.680 0.001 /tmp/bzr.dev/bzrlib/changeset.py:682(get_new_path) 22295 0.120 0.000 0.120 0.000 :0(startswith) 10165 0.110 0.000 0.380 0.000 /tmp/bzr.dev/bzrlib/store.py:69(_path) 9090 0.100 0.000 0.120 0.000 /tmp/bzr.dev/bzrlib/inventory.py:408(add) 2639 0.090 0.000 0.450 0.000 /usr/lib/python2.4/gzip.py:242(_read) 430 0.080 0.000 0.080 0.000 :0(flush) 15767 0.080 0.000 0.080 0.000 :0(rstrip) 103 0.080 0.001 0.880 0.009 /tmp/bzr.dev/bzrlib/inventory.py:487(from_element) 1617 0.080 0.000 0.080 0.000 :0(decompress) 22160 0.070 0.000 0.070 0.000 :0(endswith) The main difference is the :0(get) time that is much higher in the lxml case. C functions are tagged :0(function_name) and agreggated by names. This is annoying since for instance, lxml/_elementpath.py:167(_compile) calls C function named 'get' whose stats get agreggated whith the stats of all of other 'get' C functions of the python standard library. One has to use the print_callers method of the pstats.Stats object to list all the different contribitions: :0(get) /tmp/bzr.dev-lxml/bzrlib/changeset.py:809(longest_to_shortest)(194) 0.000 /tmp/bzr.dev-lxml/bzrlib/changeset.py:829(rename_to_temp_delete)(98) 0.000 /tmp/bzr.dev-lxml/bzrlib/changeset.py:870(rename_to_new_create)(98) 0.100 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1133(get_inventory_change)(98) 0.070 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1366(make_basic_entry)(198) 0.160 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1516(get_path)(100) 0.010 /tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element)(62944) 1.430 /tmp/bzr.dev-lxml/bzrlib/revision.py:159(unpack_revision)(808) 0.040 /usr/lib/python2.4/encodings/__init__.py:69(search_function)(3) 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:167(_compile)(202) 0.010 /usr/lib/python2.4/sre.py:213(_compile)(100) 0.000 /usr/lib/python2.4/sre_parse.py:225(_class_escape)(1) 0.000 This :0(get) time is the main difference between the two profiles. The parsing times :0(_parse) and :0 are(parse) are the same. So to summarize, this study tells us that to improve lxml over cElementTree on this particular case, one should first focus on optimizing the following piece of code (or the related code that uses it in lxml) in _elementpath.py: """ _cache = {} ## # (Internal) Compile path. def _compile(path): p = _cache.get(path) if p is not None: return p p = Path(path) if len(_cache) >= 100: _cache.clear() _cache[path] = p return p """ Best, -- Olivier
Olivier Grisel wrote:
I updated my script to use the 'profile' module instead of 'hotshot' since 'profile' displays calls to C functions and builtins.
Cool, thanks again for doing all this work! [snip]
The main difference is the :0(get) time that is much higher in the lxml case.
C functions are tagged :0(function_name) and agreggated by names. This is annoying since for instance, lxml/_elementpath.py:167(_compile) calls C function named 'get' whose stats get agreggated whith the stats of all of other 'get' C functions of the python standard library. One has to use the print_callers method of the pstats.Stats object to list all the different contribitions:
:0(get) /tmp/bzr.dev-lxml/bzrlib/changeset.py:809(longest_to_shortest)(194) 0.000
/tmp/bzr.dev-lxml/bzrlib/changeset.py:829(rename_to_temp_delete)(98) 0.000
/tmp/bzr.dev-lxml/bzrlib/changeset.py:870(rename_to_new_create)(98) 0.100
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1133(get_inventory_change)(98) 0.070
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1366(make_basic_entry)(198) 0.160
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1516(get_path)(100) 0.010
/tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element)(62944) 1.430
/tmp/bzr.dev-lxml/bzrlib/revision.py:159(unpack_revision)(808) 0.040
/usr/lib/python2.4/encodings/__init__.py:69(search_function)(3) 0.000
/usr/lib/python2.4/site-packages/lxml/_elementpath.py:167(_compile)(202) 0.010
/usr/lib/python2.4/sre.py:213(_compile)(100) 0.000
/usr/lib/python2.4/sre_parse.py:225(_class_escape)(1) 0.000
This :0(get) time is the main difference between the two profiles. The parsing times :0(_parse) and :0 are(parse) are the same. So to summarize, this study tells us that to improve lxml over cElementTree on this particular case, one should first focus on optimizing the following piece of code (or the related code that uses it in lxml) in _elementpath.py:
""" _cache = {}
## # (Internal) Compile path.
def _compile(path): p = _cache.get(path) if p is not None: return p p = Path(path) if len(_cache) >= 100: _cache.clear() _cache[path] = p return p """
I'm a bit surprised by this -- this code should be the same as in cElementTree, as it uses the same implementation of find(). This code does not appear to be any calling into ElementTree specific portions at all. _elementpath.py I took from ElementTree, and cElementTree uses the same implementation. Additionally, this code does not appear to be doing any calling into ElementTree specific portions at all. Why did you focus on this _compile() not look at what from_element() is doing, for instance? I'd like to know what kind of get call that is doing (a get() in lxml.etree instead of a get() of a python dictionary). The only get() i can think of is one that gets an attribute from an element. If get() calls dominate then optimizing that might have good effect. It may however be that certain calls are hidden from the profiler. A good example of this are accesses of the 'tag' attribute -- I'd expect an _elementpath find() operation to call that a lot. Note again that I expect most find() calls are relatively easy to change into an .xpath() call, which may be the most effective way to get the performances of the lxml version and the cElementTree version closer together. Regards, Martijn
Hey, Experimentally I've just checked in a possibly slightly faster implementation of .get() into lxml. You might want to run your profiling again. I don't think it'll help enormously much, but it should help some as I avoid some object creation. I expect getting the speed of .get() near that of cElementTree may actually be quite a hard task. cElementTree has the benefit of being able to store element and attribute names in Clarke notation ('{http://www.foo.nl}bar') directly, and then could use very fast Python dictionary access to get out the attribute value. lxml has to decode this first. Additionally, the attribute value will be already in the right format (Python unicode string or plain ascii string). lxml will have to make a translation from UTF-8 to unicode or ascii Python strings. On the string translation front there is some hope that future versions of libxml2 may allow at least some caching of pre-translated strings, but it'll be pretty hard overall to beat cElementTree on pure tree access performance. The only chance is by cheating and using the pure-C xpath. I have some hopes that this sometimes might beat cElementTree, though only for relatively complicated and thus relatively rare XPath expressions. Where lxml does compete is parsing speed (it's about equivalent to cElementTree's), and I imagine lxml is also pretty good at serialization speed. Where lxml really shines is in all the XML features it supports, with still very competitive performance. After all, comparing with cElementTree is comparing to possibly the fastest XML processing library in Python. Regards, Martijn
Martijn Faassen wrote:
Hey,
Hey :)
Experimentally I've just checked in a possibly slightly faster implementation of .get() into lxml. You might want to run your profiling again. I don't think it'll help enormously much, but it should help some as I avoid some object creation.
I just checked it out and rerun the timeit on the local branch: it now takes around 2.2s instead of 2.34s (and 1.86s for cetree).
I expect getting the speed of .get() near that of cElementTree may actually be quite a hard task. cElementTree has the benefit of being able to store element and attribute names in Clarke notation ('{http://www.foo.nl}bar') directly, and then could use very fast Python dictionary access to get out the attribute value. lxml has to decode this first.
Here are the details of the callers of the :0(get): In [1]:import pstats In [2]:s = pstats.Stats('bzr_lxml.prof') In [3]:s.print_callers(':0\(get\)') Random listing order was used List reduced from 429 to 1 due to restriction <':0\\(get\\)'> Function was called by... :0(get) /tmp/bzr.dev-lxml/bzrlib/changeset.py:809(longest_to_shortest)(194) 0.000 /tmp/bzr.dev-lxml/bzrlib/changeset.py:829(rename_to_temp_delete)(98) 0.000 /tmp/bzr.dev-lxml/bzrlib/changeset.py:870(rename_to_new_create)(98) 0.070 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1133(get_inventory_change)(98) 0.040 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1366(make_basic_entry)(198) 0.150 /tmp/bzr.dev-lxml/bzrlib/changeset.py:1516(get_path)(100) 0.000 /tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element)(62944) 1.290 /tmp/bzr.dev-lxml/bzrlib/revision.py:159(unpack_revision)(808) 0.020 /usr/lib/python2.4/encodings/__init__.py:69(search_function)(3) 0.000 /usr/lib/python2.4/site-packages/lxml/_elementpath.py:167(_compile)(202) 0.000 /usr/lib/python2.4/sre.py:213(_compile)(100) 0.000 /usr/lib/python2.4/sre_parse.py:225(_class_escape)(1) 0.000 Out[3]:<pstats.Stats instance at 0xb7ccc7ac> In [4]:s2 = pstats.Stats('bzr_cetree.prof') In [5]:s2.print_callers(':0\(get\)') Random listing order was used List reduced from 431 to 1 due to restriction <':0\\(get\\)'> Function was called by... :0(get) /tmp/bzr.dev/bzrlib/changeset.py:809(longest_to_shortest)(194) 0.000 /tmp/bzr.dev/bzrlib/changeset.py:829(rename_to_temp_delete)(98) 0.000 /tmp/bzr.dev/bzrlib/changeset.py:870(rename_to_new_create)(98) 0.080 /tmp/bzr.dev/bzrlib/changeset.py:1133(get_inventory_change)(98) 0.060 /tmp/bzr.dev/bzrlib/changeset.py:1366(make_basic_entry)(198) 0.160 /tmp/bzr.dev/bzrlib/changeset.py:1516(get_path)(100) 0.000 /tmp/bzr.dev/bzrlib/inventory.py:189(from_element)(62944) 0.720 /tmp/bzr.dev/bzrlib/revision.py:159(unpack_revision)(808) 0.030 /usr/lib/python2.4/encodings/__init__.py:69(search_function)(3) 0.000 /usr/lib/python2.4/sre.py:213(_compile)(100) 0.000 /usr/lib/python2.4/sre_parse.py:225(_class_escape)(1) 0.000 /tmp/bzr.dev-lxml/bzrlib/inventory.py:189 So the main difference actually comes from the from_element call which is basically a bunch of 6 calls to elt.get (and not the dictionnary as you said). So unless libxml2 implements some encoding cache, cElementTree will remain unbeatable on this kind of operations. But still, 2.2s against 1.84s is not that bad :)
Where lxml does compete is parsing speed (it's about equivalent to cElementTree's), and I imagine lxml is also pretty good at serialization speed.
Yes, lxml is actually faster on pure parsing (python's .org rdf file with timeit): lxml_parse(fn()) 0.00348305702209 lxml_parse(fs_f()) 0.00358319282532 lxml_parse(cs_f()) 0.00343585014343 cetree_parse(fn()) 0.00477194786072 cetree_parse(fs_f()) 0.00471210479736 cetree_parse(cs_f()) 0.00483894348145 (fn is parsing from filename, fs_f from a file object of a file system file, and cs_f from a cStringIO file handler).
Where lxml really shines is in all the XML features it supports, with still very competitive performance. After all, comparing with cElementTree is comparing to possibly the fastest XML processing library in Python.
True. So, what's happening at europython on the xml side? Do you plan th publish your slides on lxml sowmewhere? -- Olivier
Hi Olivier (and those listening in), Back from EuroPython again! Olivier Grisel wrote:
Martijn Faassen wrote:
Experimentally I've just checked in a possibly slightly faster implementation of .get() into lxml. You might want to run your profiling again. I don't think it'll help enormously much, but it should help some as I avoid some object creation.
I just checked it out and rerun the timeit on the local branch: it now takes around 2.2s instead of 2.34s (and 1.86s for cetree).
Okay, no surprises there. A bit faster, which is good, though at a cost of a bit more redundant code. I can refactor that to be a bit less redundant later. [snip]
So the main difference actually comes from the from_element call which is basically a bunch of 6 calls to elt.get (and not the dictionnary as you said). So unless libxml2 implements some encoding cache, cElementTree will remain unbeatable on this kind of operations. But still, 2.2s against 1.84s is not that bad :)
Yup, not bad. Thanks for all the measurements! I do want to explore the encoding cache option, though it may require some changes in libxml2. I need to bring this up with Daniel Veillard again sometime soon.
Where lxml does compete is parsing speed (it's about equivalent to cElementTree's), and I imagine lxml is also pretty good at serialization speed.
Yes, lxml is actually faster on pure parsing (python's .org rdf file with timeit):
lxml_parse(fn()) 0.00348305702209 lxml_parse(fs_f()) 0.00358319282532 lxml_parse(cs_f()) 0.00343585014343 cetree_parse(fn()) 0.00477194786072 cetree_parse(fs_f()) 0.00471210479736 cetree_parse(cs_f()) 0.00483894348145
(fn is parsing from filename, fs_f from a file object of a file system file, and cs_f from a cStringIO file handler).
Cool! [snip]
So, what's happening at europython on the xml side? Do you plan th publish your slides on lxml sowmewhere?
I just uploaded my (fairly boring) magicpoint slides today. You can find them here (at the bottom): http://www.python-in-business.org/ep2005/talk.chtml?talk=2386&track=692 I had a very good chat with Tim Parkin of Pollenation at EuroPython, and I am looking forward to some contribution to lxml from that direction. Regards, Martijn
participants (2)
-
Martijn Faassen
-
Olivier Grisel