[lxml-dev] Faster parsing!
Hi, here is a (pretty ugly, hackish) patch against libxml2 2.6.32 that replaces the hash function of the internal hash table implementation by one that I found on the web: http://www.azillionmonkeys.com/qed/hash.html Remember that cElementTree is still the fastest XML tree parser for Python? By a factor of up to 10 compared to lxml? http://codespeak.net/lxml/performance.html#parsing and-serialising According to lxml's benchmark suite, this patch brings the parser of libxml2/lxml down to 2 times (10x->2x!) the parsing time of cElementTree for larger files (some MB). I find this quite impressive. Here are the numbers (lower is better): lxe: XML (SAXR T1) 39.4800 msec/pass # pretty large tree cET: XML (SAXR T1) 20.0679 msec/pass lxe: XML (SAXR T3) 25.9020 msec/pass cET: XML (SAXR T3) 33.2189 msec/pass lxe: XML (SAXR T4) 0.7598 msec/pass cET: XML (SAXR T4) 0.7181 msec/pass While the benchmark is not a particularly good measure for this exact case as it generates the XML tag names instead of sticking to a (likely smaller) fixed set of language tags, this gives me a factor of 7 (!) in performance improvement for in-memory parsing compared to an unpatched libxml2. I also reran the old testament benchmark for a more realistic benchmark scenario. The speedup there is up to 30%, not that bad either. And lxml's "parse()+iter()" implementation of that benchmark is now as fast as cET's "iterparse()" version. :) I would love to get some feedback from others who want to test this. Just patch your copy of libxml2 and let lxml run against it. I'm eager to hear some numbers to convince Daniel to get a cleaned up version of this patch into mainstream libxml2. Hope you like it, Stefan
Hi, Stefan Behnel wrote:
here is a patch against libxml2 2.6.32 that replaces the hash function of the internal hash table implementation by one that I found on the web
a cleaned up version of this patch will be integrated into libxml2 2.6.33. It won't make a difference for those who parse 'only' HTML or other single languages with a somewhat small vocabulary (tags/attributes), but if you parse many different types of XML documents (XSD, XSLT, your language, ...), you will notice a difference. Stefan
Op Dinsdag 2008-04-22 skryf Stefan Behnel:
Hi,
Stefan Behnel wrote:
here is a patch against libxml2 2.6.32 that replaces the hash function of the internal hash table implementation by one that I found on the web
a cleaned up version of this patch will be integrated into libxml2 2.6.33. It won't make a difference for those who parse 'only' HTML or other single languages with a somewhat small vocabulary (tags/attributes), but if you parse many different types of XML documents (XSD, XSLT, your language, ...), you will notice a difference.
Stefan
Well done, Stefan. I'm not sure if this patch will help me specifically, but I really appreciate the work you put into lxml. I'm really glad I ported our code in this direction :-) Keep well Friedel
participants (2)
-
F Wolff
-
Stefan Behnel