Hi, warming up this really old thread. jholg@gmx.de, 05.11.2009 14:08:
I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing:
I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema):
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992]
So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617]
Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185]
What???
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859]
Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ?
libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this).
I think this will finally be fixed in libxml2 2.9. Daniel Veillard just merged in patches that optimise both the "//" XPath axis and the node set sorting, which now uses the all famous timsort algorithm. Expect major speed-ups in the XPath handling code with the next libxml2 release. Stefan