Performance of _Element.find()
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
Hi, Suppose I have a large and shallow XML tree; in my case a <book> with several thousand <par> elements. I also have a large number (thousands) of paths like ... chapter[2]/par[19] ... chapter[2]/par[538]/em ... chapter[2]/par[1937] ... I started with a loop that iterates over these paths, calls find() and then manipulates the attributes of the found _Element instance. That could take several tens of seconds. It turns out that the culprit in my loop was the call to find(), accounting for 99% of the time. So I tried to see what would happen if I used an index map, and that's being generated in two passes: 1. Iterate over all paths, split them into their components and use those components as dict keys. Nest the dictionaries according to their path component. The value is a tuple (elem, dict()) where elem will be filled in by the second pass, and the dictionary is for nesting. For the above example: { 'chapter[2]': (None, { 'par[19]': (None, {}), 'par[538]': (None, { 'em': (None, {}) }), 'par[1937]: (None, {}), }), } 2. Iterate over all nodes of the XML tree (xpath('//*')) and get their path. Then fill the above dictionary with the elem references for those which are in that dictionary, i.e. replace the None with _Element instance references. Building that index map is negligible. Using this index map to find the elements in the XML tree is orders of magnitude faster than using find() -- iterating over all path expressions to manipulate attributes of elements went from several tens of seconds to a fraction of a second. I am astonished that find() is so slow? Why is that? Is walking down the tree based on the path string that expensive? Cheers, Jens -- Jens Tröger http://savage.light-speed.de/
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 19.04.2015 um 16:07:
Did you try lxml 3.4.3? It fixes a performance regression in the expression cache that the find*() methods use. That might not apply in your case, though, assuming that your path expressions are distinct. However, note that evaluating expressions like the above means traversing the tree and counting elements from the very top *each time*. That's O(#elements * #expressions) operations. It's not surprising that traversing the tree once in O(#elements) and then only looking up each path in O(1) time is faster, as that only adds up to O(#elements + #expressions) operations (with a little uncertainty regarding the tree depth in both cases, but your paths seem to be short enough to assume it's constant). Especially counting elements in path expressions (i.e. using numeric indices like you do) is not fast. Stefan
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
On Sun, Apr 19, 2015 at 04:26:59PM +0200, Stefan Behnel wrote:
Thank you, Stefan. No, I haven't tried the latest version of lxml; my paths are all in parts or completely distinct. That would depend on the XML though and I'd say that in most cases 1-2 segments of a 5-6 segment path are the same. There is a maximum depth to my paths: 6 or 7 if I remember correctly. The counters in the path expressions come from lxml's getpath() function. Cheers, Jens -- Jens Tröger http://savage.light-speed.de/
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 26.04.2015 um 20:57:
Ah. Then you could also try passing them into XPath() instead of find(). Could be faster for this specific case. (But I'd still doubt it's as fast as a straight lookup.) Another thing you could try is to cache Element proxies in a list, e.g. cache = list(root.iter()) and *then* running the find() searches. That will avoid creating throw-away proxies just for the traversal. All of this being said, here is why element indexing is so slow in ElementPath: https://github.com/lxml/lxml/blob/1e7d8a8344d4c92d4b0571f34c459767c5546fdd/s... Inherited from ElementTree, BTW: https://hg.python.org/cpython/file/a6140242b79f/Lib/xml/etree/ElementPath.py... Improvements welcome. :) Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 19.04.2015 um 16:07:
Did you try lxml 3.4.3? It fixes a performance regression in the expression cache that the find*() methods use. That might not apply in your case, though, assuming that your path expressions are distinct. However, note that evaluating expressions like the above means traversing the tree and counting elements from the very top *each time*. That's O(#elements * #expressions) operations. It's not surprising that traversing the tree once in O(#elements) and then only looking up each path in O(1) time is faster, as that only adds up to O(#elements + #expressions) operations (with a little uncertainty regarding the tree depth in both cases, but your paths seem to be short enough to assume it's constant). Especially counting elements in path expressions (i.e. using numeric indices like you do) is not fast. Stefan
data:image/s3,"s3://crabby-images/d4c59/d4c59ab2629f45fa029ab7aa5d1e5737f6631d46" alt=""
On Sun, Apr 19, 2015 at 04:26:59PM +0200, Stefan Behnel wrote:
Thank you, Stefan. No, I haven't tried the latest version of lxml; my paths are all in parts or completely distinct. That would depend on the XML though and I'd say that in most cases 1-2 segments of a 5-6 segment path are the same. There is a maximum depth to my paths: 6 or 7 if I remember correctly. The counters in the path expressions come from lxml's getpath() function. Cheers, Jens -- Jens Tröger http://savage.light-speed.de/
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Jens Tröger schrieb am 26.04.2015 um 20:57:
Ah. Then you could also try passing them into XPath() instead of find(). Could be faster for this specific case. (But I'd still doubt it's as fast as a straight lookup.) Another thing you could try is to cache Element proxies in a list, e.g. cache = list(root.iter()) and *then* running the find() searches. That will avoid creating throw-away proxies just for the traversal. All of this being said, here is why element indexing is so slow in ElementPath: https://github.com/lxml/lxml/blob/1e7d8a8344d4c92d4b0571f34c459767c5546fdd/s... Inherited from ElementTree, BTW: https://hg.python.org/cpython/file/a6140242b79f/Lib/xml/etree/ElementPath.py... Improvements welcome. :) Stefan
participants (2)
-
Jens Tröger
-
Stefan Behnel