use of libxml2 xmlXPathOrderDocElems routine?

Hi everyone, I've read in a few places that libxml2's xmlXPathOrderDocElems can help speed up XPath queries when doing read-only lookups on documents. http://www.xmlsoft.org/html/libxml-xpath.html#xmlXPathOrderDocElems Has anyone tried it? Would it make sense to have a method like .xpath_speedup() or something on the _Element class? Perl LibXML has it: http://search.cpan.org/~shlomif/XML-LibXML-2.0107/lib/XML/LibXML/Document.po... Mentioned here: http://code.activestate.com/lists/perl-xml/8161/
This Zori project (I know nothing about) is using it: https://www.assembla.com/spaces/Zori/wiki https://www.assembla.com/code/Zori/subversion/nodes/372/trunk/src/ens/ens_re...
Regards, Paul.

Paul Tremberth, 25.11.2013 13:47:
Hadn't heard about it yet. I could imagine having a static method on the XPath class, something like "XPath.freeze_tree(element_or_tree)". Although with a big, fat warning that any later modification to the tree will break XPath. (Ok, lxml could also remember this operation in the _Document and run through the tree to clean it up before doing any modifications to it, but I don't mind requiring users to be aware of the tradeoff...) That being said, I would also guess that using the find*() methods instead of XPath would provide a similar speedup in many cases. Their path language is a lot simpler, but they don't need to do any node sorting by design and can avoid tag name string comparisons during deep subtree traversal (using el.iter()). So, in cases where your plain path expression is more selective than any "[...]" conditions (e.g. attribute value or text comparisons), the .find*() methods should win. Plus, they use iterators instead of one-shot collectors, so you can nicely short-circuit your search with them. Stefan

-----Original Message----- From: lxml [mailto:lxml-bounces@lxml.de] On Behalf Of Stefan Behnel Sent: 26 November 2013 07:13 To: lxml@lxml.de Subject: Re: [lxml] use of libxml2 xmlXPathOrderDocElems routine? Paul Tremberth, 25.11.2013 13:47:
Hadn't heard about it yet. I could imagine having a static method on the XPath class, something like "XPath.freeze_tree(element_or_tree)". Although with a big, fat warning that any later modification to the tree will break XPath. (Ok, lxml could also remember this operation in the _Document and run through the tree to clean it up before doing any modifications to it, but I don't mind requiring users to be aware of the tradeoff...) That being said, I would also guess that using the find*() methods instead of XPath would provide a similar speedup in many cases. Their path language is a lot simpler, but they don't need to do any node sorting by design and can avoid tag name string comparisons during deep subtree traversal (using el.iter()). So, in cases where your plain path expression is more selective than any "[...]" conditions (e.g. attribute value or text comparisons), the .find*() methods should win. Plus, they use iterators instead of one-shot collectors, so you can nicely short-circuit your search with them. Stefan _________________________________________________________________ If it's just a matter of calling a method on the root node and you get a speed increase (subject to the restrictions you mentioned) then I'm +1 on this (anything that helps with speed is good :) If the tradeoff is documented with the method and it's entirely optional then why not? Having said that, in a quick test I've found that using findtext() is definitely quicker than using xpath() - does anyone know would xmlXPathOrderDocElems give a comparable speedup (better/worse)? And would it speed up the find* methods too? Brian

Brian Bird, 26.11.2013 10:25:
Well, the tradeoff is that subsequent tree modifications break the assumptions of this optimisation without disabling the optimisation, so you may get incorrect results if the cleanup (or another run of xmlXPathOrderDocElems()) isn't done after your changes. I'm sure it provides a benefit for XPath on static trees, although that may also have decreased with recent libxml2 versions (which avoid some unnecessary sorting in the first place), starting with 2.9 IIRC. Anyway, my guess is that XPath is mostly executed on trees that don't change, so it should be worth supporting this optimisation somehow, maybe even as an internal optimisation. Documents could track whether they were modified in an incompatible way (which AFAICT are only insertions), and if so, then just run xmlXPathOrderDocElems() before the next call into XPath. That would assume that either XPath queries or modifications would dominate for a given tree during a certain time interval, but not both get mixed at the same time.
Having said that, in a quick test I've found that using findtext() is definitely quicker than using xpath() - does anyone know would xmlXPathOrderDocElems give a comparable speedup (better/worse)?
Depends. The optimisations in both implementations are very different, so their relative performance will vary a lot across different kinds of usages. See my quotes above.
And would it speed up the find* methods too?
No, they use a completely separate implementation, not XPath. Stefan

Paul Tremberth, 25.11.2013 13:47:
Hadn't heard about it yet. I could imagine having a static method on the XPath class, something like "XPath.freeze_tree(element_or_tree)". Although with a big, fat warning that any later modification to the tree will break XPath. (Ok, lxml could also remember this operation in the _Document and run through the tree to clean it up before doing any modifications to it, but I don't mind requiring users to be aware of the tradeoff...) That being said, I would also guess that using the find*() methods instead of XPath would provide a similar speedup in many cases. Their path language is a lot simpler, but they don't need to do any node sorting by design and can avoid tag name string comparisons during deep subtree traversal (using el.iter()). So, in cases where your plain path expression is more selective than any "[...]" conditions (e.g. attribute value or text comparisons), the .find*() methods should win. Plus, they use iterators instead of one-shot collectors, so you can nicely short-circuit your search with them. Stefan

-----Original Message----- From: lxml [mailto:lxml-bounces@lxml.de] On Behalf Of Stefan Behnel Sent: 26 November 2013 07:13 To: lxml@lxml.de Subject: Re: [lxml] use of libxml2 xmlXPathOrderDocElems routine? Paul Tremberth, 25.11.2013 13:47:
Hadn't heard about it yet. I could imagine having a static method on the XPath class, something like "XPath.freeze_tree(element_or_tree)". Although with a big, fat warning that any later modification to the tree will break XPath. (Ok, lxml could also remember this operation in the _Document and run through the tree to clean it up before doing any modifications to it, but I don't mind requiring users to be aware of the tradeoff...) That being said, I would also guess that using the find*() methods instead of XPath would provide a similar speedup in many cases. Their path language is a lot simpler, but they don't need to do any node sorting by design and can avoid tag name string comparisons during deep subtree traversal (using el.iter()). So, in cases where your plain path expression is more selective than any "[...]" conditions (e.g. attribute value or text comparisons), the .find*() methods should win. Plus, they use iterators instead of one-shot collectors, so you can nicely short-circuit your search with them. Stefan _________________________________________________________________ If it's just a matter of calling a method on the root node and you get a speed increase (subject to the restrictions you mentioned) then I'm +1 on this (anything that helps with speed is good :) If the tradeoff is documented with the method and it's entirely optional then why not? Having said that, in a quick test I've found that using findtext() is definitely quicker than using xpath() - does anyone know would xmlXPathOrderDocElems give a comparable speedup (better/worse)? And would it speed up the find* methods too? Brian

Brian Bird, 26.11.2013 10:25:
Well, the tradeoff is that subsequent tree modifications break the assumptions of this optimisation without disabling the optimisation, so you may get incorrect results if the cleanup (or another run of xmlXPathOrderDocElems()) isn't done after your changes. I'm sure it provides a benefit for XPath on static trees, although that may also have decreased with recent libxml2 versions (which avoid some unnecessary sorting in the first place), starting with 2.9 IIRC. Anyway, my guess is that XPath is mostly executed on trees that don't change, so it should be worth supporting this optimisation somehow, maybe even as an internal optimisation. Documents could track whether they were modified in an incompatible way (which AFAICT are only insertions), and if so, then just run xmlXPathOrderDocElems() before the next call into XPath. That would assume that either XPath queries or modifications would dominate for a given tree during a certain time interval, but not both get mixed at the same time.
Having said that, in a quick test I've found that using findtext() is definitely quicker than using xpath() - does anyone know would xmlXPathOrderDocElems give a comparable speedup (better/worse)?
Depends. The optimisations in both implementations are very different, so their relative performance will vary a lot across different kinds of usages. See my quotes above.
And would it speed up the find* methods too?
No, they use a completely separate implementation, not XPath. Stefan
participants (3)
-
Brian Bird
-
Paul Tremberth
-
Stefan Behnel