Re: [lxml-dev] Some benchmarks

10 Mar 2008


      On Mon, 10 Mar 2008 21:15:58 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
...
...
I also tried allocating random strings until the size increased, to see
if there was lots of allocated but free memory (the unused amount is an
estimate, as I'm unsure what the exact internal representation of a list
of strings is).  The results were peculiar:
VSZ       RSS (used)
lxml           :  26952  /  26211   (unused:     5)
bs             :  83408  /  82156   (unused:     0)
html5_cet      :  55640  /  54745   (unused:    19)
html5_et       :  65712  /  64946   (unused:    14)
html5_lxml     :  50072  /  48986   (unused:   134)
html5_minidom  : 195372  / 192914   (unused:    14)
html5_simple   :  99772  /  97999   (unused:    17)
lxml_bs        : 104644  /  73037   (unused: 31783)
htmlparser     :   4448  /   4433   (unused:    19)
I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots
of free memory left over at the end.  I am surprised that the others
don't, at least html5_lxml should be similar I'd think (though I guess
if you take into account the unused memory then html5_lxml and lxml_bs
are similar).
That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use
Python's memory management, so memory that is freed by the parser is really
freed to the OS, not just left as a growing interpreter heap.
Not necessarily. libxml2 uses the c libraries
free/malloc. Historically, on Unix systems the C library free/malloc
don't return the memory to the OS, but keep it in an internal
heap. Systems that are Not Unix tend to do otherwise, creating some
confusion for people moving from those systems to unix.
...
...
I haven't looked hard (yet, at least).  The example on the ElementSoup
page parses *slightly* better with BS, but lxml parses it very similarly
to how html5lib parses it, which I'd consider the better standard.
html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an
interesting test to see how differently html5lib, BS, and lxml parse it.
 I'm not sure where to find a good collection like that.  Maybe
html5lib's tests, I guess.
There seem to be a fair amount of HTML browser compliance test suites on the
web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html  :-).

  <mike
-- 
Mike Meyer <mwm@mired.org>		http://www.mired.org/consulting.html
Independent Network/Unix/Perforce consultant, email for more information.