[lxml-dev] XPath optimization troubles.

Hello, I expect this is properly a libxml2 question, but it's weird enough I wanted to check here first to make sure that lxml isn't effecting the results. I have equivalent XPath expressions, one using prefixes to do the selection, and one using namespace-uri to do the check. The namespace-uri version consistently runs 2-3x faster on a range of test data, and I have no idea why. Here's the prefix version:
'//@gizmo:*/parent::*[ not( self::gizmo:* ) ]'
And here's the namespace-uri version:
'//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != "%(gizmo)s" ]' % namespaces
I'm running these as compiled expressions using etree.XPath( ..., namespaces = namespaces ), if that makes a difference. Any hints? Or a faster XPath to do the same thing for the ambitiously bored? -- John Krukoff <jkrukoff@ltgc.com> Land Title Guarantee Company

John Krukoff wrote:
I expect this is properly a libxml2 question, but it's weird enough I wanted to check here first to make sure that lxml isn't effecting the results.
I have equivalent XPath expressions, one using prefixes to do the selection, and one using namespace-uri to do the check. The namespace-uri version consistently runs 2-3x faster on a range of test data, and I have no idea why.
Here's the prefix version:
'//@gizmo:*/parent::*[ not( self::gizmo:* ) ]'
And here's the namespace-uri version:
'//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != "%(gizmo)s" ]' % namespaces
I'm running these as compiled expressions using etree.XPath( ..., namespaces = namespaces ), if that makes a difference. Any hints? Or a faster XPath to do the same thing for the ambitiously bored?
Just guessing: in libxml2, a node (element/attribute) knows its namespace URI, so comparing it to a constant string is a fast and local operation. Comparing namespace prefixes requires an indirection, as the prefix is mapped to a URI by the XPath evaluation context, and only the URI can be compared in a meaningful way. I'm not sure if the XPath engine can optimise this, as it might be possible to change the namespace-prefix mapping during the run. So simply replacing the prefix by a URI check may not be correct. But this is something that might be worth bringing to the attention of the libxml2 mailing list. BTW, have you also measured the performance of using an XPath variable for the URI in the second case? Stefan

On Fri, 2009-10-23 at 11:49 +0200, Stefan Behnel wrote:
I expect this is properly a libxml2 question, but it's weird enough I wanted to check here first to make sure that lxml isn't effecting the results.
I have equivalent XPath expressions, one using prefixes to do the selection, and one using namespace-uri to do the check. The namespace-uri version consistently runs 2-3x faster on a range of test data, and I have no idea why.
Here's the prefix version:
'//@gizmo:*/parent::*[ not( self::gizmo:* ) ]'
And here's the namespace-uri version:
'//@*[ namespace-uri( ) = "%(gizmo)s" ]/parent::*[ namespace-uri( ) != "%(gizmo)s" ]' % namespaces <snipped> BTW, have you also measured the performance of using an XPath variable for
John Krukoff wrote: the URI in the second case?
Stefan
Finally got around to giving this a try, and the performance difference for using XPath variables looks to be negligible. Switched over to using variables then, as the substitution is obviously safer in the general case, and it looks a bit cleaner. Wanted to mention on the list, in case anybody else ends up using this optimization. As always, thanks for the tip Stefan. -- John Krukoff <jkrukoff@ltgc.com> Land Title Guarantee Company
participants (2)
-
John Krukoff
-
Stefan Behnel