I do have to make copies of nodes in an application I'm writing, so I've
tried making a patch for lxml to allow me to do that. The patch is
attached, but since I don't have experience with Pyrex and libxml2 there
could well be something wrong with it. It seems to work, though.
Also I don't know if creating a new document / _ElementTree for the new
nodes is the right thing to do, but I suppose it should be...
Regards
Florian
Index: src/lxml/etree.pyx
==========================================…
[View More]=========================
--- src/lxml/etree.pyx (revision 12556)
+++ src/lxml/etree.pyx (working copy)
@@ -370,7 +370,21 @@
c_node = c_node.next
else:
raise ValueError, "Matching element could not be found"
+
+ def copy(self, recursive = True):
+ # 1 = copy recursive; 2 = don't copy recursive
+ recursive = int(not recursive) + 1
+
+ cdef xmlNode* c_node
+ cdef xmlDoc* c_doc
+ c_doc = theParser.newDoc()
+ etree = _elementTreeFactory(c_doc)
+ c_node = tree.xmlDocCopyNode(self._c_node, c_doc, recursive)
+ tree.xmlDocSetRootElement(c_doc, c_node)
+
+ return _elementFactory(etree, c_node)
+
# PROPERTIES
property tag:
def __get__(self):
[View Less]
Hi list,
I updated my script to use the 'profile' module instead of 'hotshot'
since 'profile' displays calls to C functions and builtins.
Before, actually profiling, I wanted to check that cElementTree was
actually faster on the local branching operation by performing it three
times (only on the first 100 revisions of BZR instead of more than 700
at the time of writing):
ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python
profile_bzr.py -a timeit -x lxml
Timing …
[View More]against /tmp/bzr.dev-lxml/bzrlib/__init__.pyc
branching took 2.34565591812
ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python
profile_bzr.py -a timeit -x cetree
Timing against /tmp/bzr.dev/bzrlib/__init__.pyc
branching took 1.86508107185
So cElementTree is faster on the local branching operation. Now here is
the profile of the same operation in both cases:
ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python
profile_bzr.py -a profile -x lxml
Profiling against /tmp/bzr.dev-lxml/bzrlib/__init__.pyc
Added 230 texts.
Added 100 inventories.
Added 100 revisions.
Thu Jun 23 23:42:22 2005 /tmp/bzr_data.profile
385004 function calls (374095 primitive calls) in 5.700 CPU
seconds
Ordered by: internal time, call count
List reduced from 429 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
64844 0.860 0.000 0.860 0.000 :0(get)
430 0.520 0.001 0.520 0.001 :0(compress)
8992 0.430 0.000 1.310 0.000
/tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element)
21611 0.390 0.000 0.600 0.000
/usr/lib/python2.4/posixpath.py:56(join)
17627/9095 0.230 0.000 0.470 0.000
/tmp/bzr.dev-lxml/bzrlib/inventory.py:317(iter_entries)
9009 0.190 0.000 0.770 0.000
/tmp/bzr.dev-lxml/bzrlib/store.py:132(__contains__)
10165 0.170 0.000 0.490 0.000
/tmp/bzr.dev-lxml/bzrlib/store.py:69(_path)
204 0.160 0.001 0.160 0.001 :0(parse)
18878 0.160 0.000 0.160 0.000 :0(access)
22295 0.130 0.000 0.130 0.000 :0(startswith)
1 0.130 0.130 4.730 4.730
/tmp/bzr.dev-lxml/bzrlib/branch.py:727(update_revisions)
3601/1238 0.120 0.000 0.540 0.000
/tmp/bzr.dev-lxml/bzrlib/changeset.py:682(get_new_path)
9977 0.110 0.000 0.230 0.000
/usr/lib/python2.4/posixpath.py:74(split)
430 0.110 0.000 0.110 0.000 :0(flush)
103 0.110 0.001 1.540 0.015
/tmp/bzr.dev-lxml/bzrlib/inventory.py:487(from_element)
1218 0.110 0.000 0.110 0.000 :0(decompress)
9090 0.100 0.000 0.120 0.000
/tmp/bzr.dev-lxml/bzrlib/inventory.py:408(add)
3184 0.090 0.000 0.090 0.000 :0(seek)
1836 0.080 0.000 0.480 0.000
/usr/lib/python2.4/gzip.py:242(_read)
22160 0.080 0.000 0.080 0.000 :0(endswith)
ogrisel@localhost:~/Developments/lxml-testcase/grisel/profiling $ python
profile_bzr.py -a profile -x cetree
Profiling against /tmp/bzr.dev/bzrlib/__init__.pyc
Added 230 texts.
Added 100 inventories.
Added 100 revisions.
Thu Jun 23 23:42:44 2005 /tmp/bzr_data.profile
404062 function calls (393056 primitive calls) in 5.120 CPU
seconds
Ordered by: internal time, call count
List reduced from 431 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
430 0.560 0.001 0.560 0.001 :0(compress)
21611 0.460 0.000 0.650 0.000
/usr/lib/python2.4/posixpath.py:56(join)
8992 0.350 0.000 0.670 0.000
/tmp/bzr.dev/bzrlib/inventory.py:189(from_element)
64642 0.310 0.000 0.310 0.000 :0(get)
1 0.250 0.250 3.990 3.990
/tmp/bzr.dev/bzrlib/branch.py:727(update_revisions)
9977 0.220 0.000 0.310 0.000
/usr/lib/python2.4/posixpath.py:74(split)
18878 0.200 0.000 0.200 0.000 :0(access)
17627/9095 0.170 0.000 0.480 0.000
/tmp/bzr.dev/bzrlib/inventory.py:317(iter_entries)
204 0.160 0.001 0.310 0.002 :0(_parse)
9009 0.140 0.000 0.650 0.000
/tmp/bzr.dev/bzrlib/store.py:132(__contains__)
3601/1238 0.130 0.000 0.680 0.001
/tmp/bzr.dev/bzrlib/changeset.py:682(get_new_path)
22295 0.120 0.000 0.120 0.000 :0(startswith)
10165 0.110 0.000 0.380 0.000
/tmp/bzr.dev/bzrlib/store.py:69(_path)
9090 0.100 0.000 0.120 0.000
/tmp/bzr.dev/bzrlib/inventory.py:408(add)
2639 0.090 0.000 0.450 0.000
/usr/lib/python2.4/gzip.py:242(_read)
430 0.080 0.000 0.080 0.000 :0(flush)
15767 0.080 0.000 0.080 0.000 :0(rstrip)
103 0.080 0.001 0.880 0.009
/tmp/bzr.dev/bzrlib/inventory.py:487(from_element)
1617 0.080 0.000 0.080 0.000 :0(decompress)
22160 0.070 0.000 0.070 0.000 :0(endswith)
The main difference is the :0(get) time that is much higher in the lxml
case.
C functions are tagged :0(function_name) and agreggated by names. This
is annoying since for instance, lxml/_elementpath.py:167(_compile) calls
C function named 'get' whose stats get agreggated whith the stats of all
of other 'get' C functions of the python standard library. One has to
use the print_callers method of the pstats.Stats object to list all the
different contribitions:
:0(get)
/tmp/bzr.dev-lxml/bzrlib/changeset.py:809(longest_to_shortest)(194) 0.000
/tmp/bzr.dev-lxml/bzrlib/changeset.py:829(rename_to_temp_delete)(98)
0.000
/tmp/bzr.dev-lxml/bzrlib/changeset.py:870(rename_to_new_create)(98) 0.100
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1133(get_inventory_change)(98)
0.070
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1366(make_basic_entry)(198) 0.160
/tmp/bzr.dev-lxml/bzrlib/changeset.py:1516(get_path)(100) 0.010
/tmp/bzr.dev-lxml/bzrlib/inventory.py:189(from_element)(62944) 1.430
/tmp/bzr.dev-lxml/bzrlib/revision.py:159(unpack_revision)(808) 0.040
/usr/lib/python2.4/encodings/__init__.py:69(search_function)(3) 0.000
/usr/lib/python2.4/site-packages/lxml/_elementpath.py:167(_compile)(202)
0.010
/usr/lib/python2.4/sre.py:213(_compile)(100) 0.000
/usr/lib/python2.4/sre_parse.py:225(_class_escape)(1) 0.000
This :0(get) time is the main difference between the two profiles. The
parsing times :0(_parse) and :0 are(parse) are the same. So to
summarize, this study tells us that to improve lxml over cElementTree on
this particular case, one should first focus on optimizing the following
piece of code (or the related code that uses it in lxml) in _elementpath.py:
"""
_cache = {}
##
# (Internal) Compile path.
def _compile(path):
p = _cache.get(path)
if p is not None:
return p
p = Path(path)
if len(_cache) >= 100:
_cache.clear()
_cache[path] = p
return p
"""
Best,
--
Olivier
[View Less]