[lxml-dev] confusing xpath performance characteristics
Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema):
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992]
So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617]
Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185]
What???
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859]
Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
Oops,
But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ?
libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this).
That was supposed to read: Why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '//xs:element[@name="equity"]/@type' ? Holger -- DSL-Preisknaller: DSL Komplettpakete von GMX schon für 16,99 Euro mtl.!* Hier klicken: http://portal.gmx.net/de/go/dsl02
That was supposed to read:
Why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '//xs:element[@name="equity"]/@type' ?
And one more thing I forgot to mention: lxml version: 2.2.2 libxml2 version: (2, 6, 32) libxslt version: (1, 1, 23) Holger -- DSL-Preisknaller: DSL Komplettpakete von GMX schon für 16,99 Euro mtl.!* Hier klicken: http://portal.gmx.net/de/go/dsl02
Hi, jholg@gmx.de, 05.11.2009 14:08:
I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing:
I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema):
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992]
So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617]
Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185]
What???
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859]
Ok, this version's better than my naive approach, which seems logical to me.
I have no idea what libxml2 does here, but it looks like there are some optimisations going on in the most simple case. Also see the other thread two weeks ago, where John Krukoff stumbled over unexpected performance differences when testing for namespaces. Consider running the evaluation through callgrind and kcachegrind to find out what happens in each case and what works differently. Stefan
[cross-posting this from lxml-dev] jholg@gmx.de, 05.11.2009 22:05:
Why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' [takes 103.62 seconds in libxml2 2.6.32] perform drastically slower than '//xs:element[@name="equity"]/@type' [takes 0.096 seconds] ?
Yes, that's surprising, especially when you see the absolute times. I ran it through callgrind, and the problem is that the first case produces separate node set results, one for each parent element that was found and searched. I.e., the evaluation works more or less like this: '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' -> walk through all "xs:schema" root elements (1 result) -> walk through all "xs:complexType" children (many found) -> for each result, search matching "xs:element" descendants -> for each match, select the "type" attribute (1 result) -> collect all matched attributes in a node-set -> merge the set of results and sort them into document order It's the last operation, merging and sorting large sets of results, that makes this extremely slow - it takes 92% of the evaluation time in my tests (using libxml2 2.7.5). It's much faster to traverse the document in a single step, and just select single attributes from it, that can quickly be appended to the node set. I imagine that this step could actually be optimised away in many cases (like the case above, where results are guaranteed to be found in doc order), so I guess it's just in there to avoid too much special casing. But it seriously kills the performance here. The sorting algorithm is an unstable shell-sort, whose exponential runtime I expect to be the key problem here. I would assume that in most cases, the partial node-sets are already in doc-order, so optimising away the sorting in favour of appending one node-set to the other whenever doc order is guaranteed to be preserved would drastically drop the runtime of the longer expression. Stefan
Hi, warming up this really old thread. jholg@gmx.de, 05.11.2009 14:08:
I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing:
I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema):
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992]
So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617]
Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185]
What???
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859]
Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ?
libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this).
I think this will finally be fixed in libxml2 2.9. Daniel Veillard just merged in patches that optimise both the "//" XPath axis and the node set sorting, which now uses the all famous timsort algorithm. Expect major speed-ups in the XPath handling code with the next libxml2 release. Stefan
participants (2)
-
jholg@gmx.de
-
Stefan Behnel