Hi, jholg@gmx.de, 05.11.2009 14:08:
I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing:
I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema):
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992]
So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617]
Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema:
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185]
What???
timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859]
Ok, this version's better than my naive approach, which seems logical to me.
I have no idea what libxml2 does here, but it looks like there are some optimisations going on in the most simple case. Also see the other thread two weeks ago, where John Krukoff stumbled over unexpected performance differences when testing for namespaces. Consider running the evaluation through callgrind and kcachegrind to find out what happens in each case and what works differently. Stefan