[lxml-dev] Callgrind tests

Hello everyone, another one for the archives. I did a few tests with Callgrind and KCachegrind (if you don't know kcachegrind, install it, you'll love it), as I was suspecting the XPath wrapper to have become slow due to the global function registries. What I found was: 1) libxml2 performance is heavily bound by malloc calls (not sure if callgrind influences this). The XPath implementation is so incredibly fast that the registration of the /builtin/ XPath functions (xmlXPathRegisterAllFunctions) and the related hash table creation (two xmlHashCreate's per XPath context) were the major bottlenecks in my tests. The overhead added by lxml itself was negligible. 2) string formatting in Python was the other problem. The major bottleneck in tree setup in bench.py was the python function that builds the element names based on loop variables (PyString_Format). Meaning, the bottleneck was /outside/ the tested code this time. So, the major result is that, for the tested parts, lxml's performance is mainly bound by two factors: Python and libxml2. I guess I can safely assume that the code parts that I checked are pretty much too small an issue to merit any further optimization efforts. Have fun, Stefan

Stefan Behnel wrote: [snip]
I'm not sure what you mean about string formatting... The bottleneck being outside the tested code means that lxml makes calls to the Python API and that this is costing time, right? So, one possibility to speed up lxml is for it to call the Python API less often. One possibility towards optimization would be getting rid of the need to decode UTF-8 (libxml2) to Python unicode all the time (or to Plain python strings if they're ascii). This could be done by caching the Python unicode/strings somehow. I discussed this a long time ago with Daniel Veillard and he mentioned extending the string dictionary in libxml2 so it could add another payload field which could be our string. If that payload is already there, we can simply return this instead of regenerating it. Regards, Martijn

Martijn Faassen schrieb:
No, actually I meant that it is the /calling/ code that becomes the bottleneck, not the code I tested. The API and everything behind it was plenty fast for the test. I know, that's not the best evaluation then (it doesn't really test the code itself), but it shows that the code that I tested is so fast that programs that use it will most likely have their performance problems elsewhere.
I didn't test unicode conversion at all. char*->String conversion did not seem to be a problem, I guess that's mainly a strlen, a memcopy and a C-object instantiation. All of those should be pretty fast.
I already thought about something like that when I went through my optimizations. When you look at some of the benchmark results, you will see that cElementTree is about 3 times faster for the element.text and element.tag benchmarks. As FL pointed out, that's due to exactly that optimization: build python strings only once. The main problem, however, is that lxml uses properties here. In my benchmarks, that turned out to produce more overhead than the actual conversion afterwards (I tested that by returning a constant string). Apart from that, there is no other point for tweaking the performance left in that part of lxml - I checked. :) Stefan

Hi Martijn, Martijn Faassen wrote:
I figured there is one place where implementing caching is cheap: element.tag. So I decided to add a _tag attribute to _Element. It is set to None at initialisation and to the tag value when the property is set. Getting the property then tests for None and returns the value if it was set before. Since we assure at most one proxy element per node, this should not bring in any inconsistencies. According to callgrind, the speedup is close to 95% for each subsequent call to element.tag after the first one - as long as the Python reference to the element persists. Obvious drawback: the first access is a little slower now, so accessing the tag names of tons of different objects will suffer. But that's somewhat acceptable - in that case, we really hit the Python string building overhead anyway. It's both in the trunk and 0.9.x (ok, perhaps that isn't really the best name, agreed). Stefan

Stefan Behnel wrote: [snip]
I'm not sure what you mean about string formatting... The bottleneck being outside the tested code means that lxml makes calls to the Python API and that this is costing time, right? So, one possibility to speed up lxml is for it to call the Python API less often. One possibility towards optimization would be getting rid of the need to decode UTF-8 (libxml2) to Python unicode all the time (or to Plain python strings if they're ascii). This could be done by caching the Python unicode/strings somehow. I discussed this a long time ago with Daniel Veillard and he mentioned extending the string dictionary in libxml2 so it could add another payload field which could be our string. If that payload is already there, we can simply return this instead of regenerating it. Regards, Martijn

Martijn Faassen schrieb:
No, actually I meant that it is the /calling/ code that becomes the bottleneck, not the code I tested. The API and everything behind it was plenty fast for the test. I know, that's not the best evaluation then (it doesn't really test the code itself), but it shows that the code that I tested is so fast that programs that use it will most likely have their performance problems elsewhere.
I didn't test unicode conversion at all. char*->String conversion did not seem to be a problem, I guess that's mainly a strlen, a memcopy and a C-object instantiation. All of those should be pretty fast.
I already thought about something like that when I went through my optimizations. When you look at some of the benchmark results, you will see that cElementTree is about 3 times faster for the element.text and element.tag benchmarks. As FL pointed out, that's due to exactly that optimization: build python strings only once. The main problem, however, is that lxml uses properties here. In my benchmarks, that turned out to produce more overhead than the actual conversion afterwards (I tested that by returning a constant string). Apart from that, there is no other point for tweaking the performance left in that part of lxml - I checked. :) Stefan

Hi Martijn, Martijn Faassen wrote:
I figured there is one place where implementing caching is cheap: element.tag. So I decided to add a _tag attribute to _Element. It is set to None at initialisation and to the tag value when the property is set. Getting the property then tests for None and returns the value if it was set before. Since we assure at most one proxy element per node, this should not bring in any inconsistencies. According to callgrind, the speedup is close to 95% for each subsequent call to element.tag after the first one - as long as the Python reference to the element persists. Obvious drawback: the first access is a little slower now, so accessing the tag names of tons of different objects will suffer. But that's somewhat acceptable - in that case, we really hit the Python string building overhead anyway. It's both in the trunk and 0.9.x (ok, perhaps that isn't really the best name, agreed). Stefan
participants (2)
-
Martijn Faassen
-
Stefan Behnel