Mailman 3 Performance of _Element.find() - lxml - The Python XML Toolkit

April 19, 2015

      Hi,

Suppose I have a large and shallow XML tree; in my case a <book> with
several thousand <par> elements.  I also have a large number (thousands)
of paths like

  ...
  chapter[2]/par[19] 
  ...
  chapter[2]/par[538]/em 
  ...
  chapter[2]/par[1937]
  ...

I started with a loop that iterates over these paths, calls find() and
then manipulates the attributes of the found _Element instance.  That
could take several tens of seconds.  It turns out that the culprit in my
loop was the call to find(), accounting for 99% of the time.

So I tried to see what would happen if I used an index map, and that's
being generated in two passes:

1. Iterate over all paths, split them into their components and use
those components as dict keys.  Nest the dictionaries according to
their path component.  The value is a tuple (elem, dict()) where elem
will be filled in by the second pass, and the dictionary is for nesting.
For the above example:

  { 'chapter[2]': (None, { 'par[19]': (None, {}),
                           'par[538]': (None, { 'em': (None, {}) }),
                           'par[1937]: (None, {}),
                         }),
  }

2. Iterate over all nodes of the XML tree (xpath('//*')) and get their
path.  Then fill the above dictionary with the elem references for those
which are in that dictionary, i.e. replace the None with _Element
instance references.

Building that index map is negligible.

Using this index map to find the elements in the XML tree is orders of
magnitude faster than using find() -- iterating over all path
expressions to manipulate attributes of elements went from several tens
of seconds to a fraction of a second.

I am astonished that find() is so slow?  Why is that?  Is walking down
the tree based on the path string that expensive?

Cheers,
Jens

-- 
Jens Tröger
http://savage.light-speed.de/

Performance of _Element.find()

Jens Tröger

Stefan Behnel

Jens Tröger

Stefan Behnel

Stefan Behnel

Jens Tröger

Stefan Behnel

tags

participants (2)