Re: [lxml-dev] Some benchmarks

10 Mar 2008

      Stefan Behnel wrote:
...
I noticed that you calculate the initial size /after/ parsing in the
--serialize case. If I move it before that, I get reasonable numbers for lxml:
+17M for 2.5MB of documents on a 32bit machine.
I didn't intend to include the --serialize option, but must have done 
so.  Though I don't know why they weren't *all* messed up then?  Anyway, 
I get 25MB, which seems quite reasonable.  Here's the revised numbers:

                   VSZ    /  RSS
lxml           :  25908  /  26232
bs             :  82508  /  82168
html5_cet      :  54616  /  54760
html5_et       :  64688  /  64964
html5_lxml     :  49056  /  49124
html5_minidom  : 194352  / 192936
html5_simple   :  99772  /  98016
lxml_bs        : 104916  / 104856
htmlparser     :   4440  /   4448

I also tried allocating random strings until the size increased, to see 
if there was lots of allocated but free memory (the unused amount is an 
estimate, as I'm unsure what the exact internal representation of a list 
of strings is).  The results were peculiar:

                   VSZ       RSS (used)
lxml           :  26952  /  26211   (unused:     5)
bs             :  83408  /  82156   (unused:     0)
html5_cet      :  55640  /  54745   (unused:    19)
html5_et       :  65712  /  64946   (unused:    14)
html5_lxml     :  50072  /  48986   (unused:   134)
html5_minidom  : 195372  / 192914   (unused:    14)
html5_simple   :  99772  /  97999   (unused:    17)
lxml_bs        : 104644  /  73037   (unused: 31783)
htmlparser     :   4448  /   4433   (unused:    19)

I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots 
of free memory left over at the end.  I am surprised that the others 
don't, at least html5_lxml should be similar I'd think (though I guess 
if you take into account the unused memory then html5_lxml and lxml_bs 
are similar).

I don't actually know if BS is better than lxml in parsing... anything. 
  I haven't looked hard (yet, at least).  The example on the ElementSoup 
page parses *slightly* better with BS, but lxml parses it very similarly 
to how html5lib parses it, which I'd consider the better standard. 
html5lib has the advantage of being a kind of standard.

If I had a good collection of crappy HTML, that would probably be an 
interesting test to see how differently html5lib, BS, and lxml parse it. 
  I'm not sure where to find a good collection like that.  Maybe 
html5lib's tests, I guess.
...
...
Another clear indication that we're measuring transient stuff is that
when using the BeautifulSoup or html5 parser with an lxml document the
memory increases substantially.  So any ideas on how to test memory
would be much appreciated.
Somewhat hard to do across libraries. For example, the way the ElementSoup
parser (i.e. BS on lxml) works, is: parse the document with BS, and then
recursively translate the tree into an lxml tree. So you temporarily use about
twice the memory. You'd have to intercept the tree builder process at the end
(before releasing the BS tree) and measure there in order to get the maximum
amount of memory used. I'd run it a couple of times and just watch top while
it's running. That way, you can figure out something close to the maximum
yourself.
I'm pretty sure what you end up with after is the maximum use, as Python 
doesn't release memory back to the operating system after its allocated 
it.  (Or at least Python 2.4 doesn't.)  So instead you have a pool of 
memory that Python isn't using, but the OS doesn't know that.  I guess 
the assumption is that if Python never needs to use it again, at least 
the OS can move it to virtual memory.
...
On the other hand, I don't know if temporary memory is of that much value for
a comparison. If it takes more space while parsing - so what? You'll likely
keep the document tree in memory much longer than the parsing takes, so that's
the dominating factor.
Right, I'm more interested in the memory the finished document takes. 
Intermediate memory use shows up in the performance numbers anyway. 
Though I don't know if all that memory use might also lead to 
fragmentation, slowing down later allocations?  This is beyond my 
understanding of Python performance.

   Ian

Re: [lxml-dev] Some benchmarks

Ian Bicking