[lxml-dev] Faster parsing!

16 Apr 2008

      Hi,

here is a (pretty ugly, hackish) patch against libxml2 2.6.32 that replaces
the hash function of the internal hash table implementation by one that I
found on the web:

http://www.azillionmonkeys.com/qed/hash.html

Remember that cElementTree is still the fastest XML tree parser for Python? By
a factor of up to 10 compared to lxml?

http://codespeak.net/lxml/performance.html#parsing and-serialising

According to lxml's benchmark suite, this patch brings the parser of
libxml2/lxml down to 2 times (10x->2x!) the parsing time of cElementTree for
larger files (some MB). I find this quite impressive. Here are the numbers
(lower is better):

lxe: XML         (SAXR T1)   39.4800 msec/pass   # pretty large tree
cET: XML         (SAXR T1)   20.0679 msec/pass

lxe: XML         (SAXR T3)   25.9020 msec/pass
cET: XML         (SAXR T3)   33.2189 msec/pass

lxe: XML         (SAXR T4)    0.7598 msec/pass
cET: XML         (SAXR T4)    0.7181 msec/pass

While the benchmark is not a particularly good measure for this exact case as
it generates the XML tag names instead of sticking to a (likely smaller) fixed
set of language tags, this gives me a factor of 7 (!) in performance
improvement for in-memory parsing compared to an unpatched libxml2. I also
reran the old testament benchmark for a more realistic benchmark scenario. The
speedup there is up to 30%, not that bad either. And lxml's "parse()+iter()"
implementation of that benchmark is now as fast as cET's "iterparse()" version. :)

I would love to get some feedback from others who want to test this. Just
patch your copy of libxml2 and let lxml run against it. I'm eager to hear some
numbers to convince Daniel to get a cleaned up version of this patch into
mainstream libxml2.

Hope you like it,

Stefan

[lxml-dev] Faster parsing!

Stefan Behnel