[lxml-dev] Some benchmarks

10 Mar 2008

      For the curious, I've attached some benchmarks.  These are preliminary, 
I'm putting together the numbers for my HTML talk at PyCon.

One thing that I'd like to test is the memory use for documents.  To do 
this I'm parsing about 4.5Mb of documents and keeping them in memory, 
and looking at the VSZ/RSS sizes reported by ps before and after.  I 
don't think this is the right/best way to do this.  For instance, 
transient memory use by some parsers makes Python grab a bunch of 
memory, but it might be free after parsing, and usable for other things. 
  Also, I don't know if VSZ/RSS is valid at all.  I get the impression 
it isn't that valid.  And the increases I'm seeing for lxml don't seem 
to be sufficient; at least the process should grow by 4.5Mb, right? 
lxml can't be that much more efficient than the serialized form of these 
files.

Another clear indication that we're measuring transient stuff is that 
when using the BeautifulSoup or html5 parser with an lxml document the 
memory increases substantially.  So any ideas on how to test memory 
would be much appreciated.

(Maybe I could look at ps, and then start creating Python objects until 
the memory use increases, so that I know I've used up any extra 
allocated memory?)

I've also attached the script, though you'll need to grab your own HTML 
files.  html_lxml is broken; I patched it locally to work 
(http://code.google.com/p/html5lib/issues/detail?id=65).

   Ian

Parsing 355 files, 4524Kb (ripped from python.org)

lxml = lxml.html
bs = BeautifulSoup
html5_cet = html5 parser with cElementTree model
html5_et = html5 parser with ElementTree model
html5_lxml = html5 parser with lxml.html model
html5_minidom = html5 parser with minidom model
html5_simple = html5 parser with internal simple_tree model
lxml_bs = BeautifulSoup parser with lxml model
htmlparser = HTMLParser, with no parser actions, document string is its own model

python tester.py --no-gc

lxml           :   0.5156 sec ( 100% of lxml)
bs             :  10.3816 sec (2013% of lxml)
html5_cet      :  29.5829 sec (5737% of lxml)
html5_et       :  30.2433 sec (5865% of lxml)
html5_lxml     :  31.7533 sec (6158% of lxml)
html5_minidom  :  34.2963 sec (6651% of lxml)
html5_simple   :  28.7421 sec (5574% of lxml)
lxml_bs        :  12.2269 sec (2371% of lxml)
htmlparser     :   3.0968 sec ( 600% of lxml)

python tester.py --no-gc --serialize

lxml           :  0.2704 sec ( 100% of lxml)
bs             :  1.8265 sec ( 675% of lxml)
html5_cet      :  1.5960 sec ( 590% of lxml)
html5_et       :  1.7677 sec ( 653% of lxml)
html5_lxml     :  0.2755 sec ( 101% of lxml)
html5_minidom  :  3.4696 sec (1283% of lxml)
html5_simple   :  1.4929 sec ( 552% of lxml)
lxml_bs        :  0.2834 sec ( 104% of lxml)

VSZ/RSS increase:
lxml:            1168  /    120
bs:             82508  /  82176
html5_cet:      54620  /  54756
html5_et:       64688  /  64960
html5_lxml:     49076  /  49124
html5_minidom: 194304  / 192928
html5_simple:   98608  /  98004
lxml_bs:       104920  / 104852
htmlparser:      5412  /   4456

Note: htmlparser keeps all the strings of the documents in memory.

[lxml-dev] Some benchmarks

Ian Bicking