![](https://secure.gravatar.com/avatar/5cb52d06b12c3400c52ccad2b9121d18.jpg?s=120&d=mm&r=g)
On Mon, 10 Mar 2008 21:15:58 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
I also tried allocating random strings until the size increased, to see if there was lots of allocated but free memory (the unused amount is an estimate, as I'm unsure what the exact internal representation of a list of strings is). The results were peculiar:
VSZ RSS (used) lxml : 26952 / 26211 (unused: 5) bs : 83408 / 82156 (unused: 0) html5_cet : 55640 / 54745 (unused: 19) html5_et : 65712 / 64946 (unused: 14) html5_lxml : 50072 / 48986 (unused: 134) html5_minidom : 195372 / 192914 (unused: 14) html5_simple : 99772 / 97999 (unused: 17) lxml_bs : 104644 / 73037 (unused: 31783) htmlparser : 4448 / 4433 (unused: 19)
I guess I'm not surprised that lxml_bs (lxml.html.ElementSoup) has lots of free memory left over at the end. I am surprised that the others don't, at least html5_lxml should be similar I'd think (though I guess if you take into account the unused memory then html5_lxml and lxml_bs are similar).
That's a somewhat unfair comparison though. lxml (read: libxml2) doesn't use Python's memory management, so memory that is freed by the parser is really freed to the OS, not just left as a growing interpreter heap.
Not necessarily. libxml2 uses the c libraries free/malloc. Historically, on Unix systems the C library free/malloc don't return the memory to the OS, but keep it in an internal heap. Systems that are Not Unix tend to do otherwise, creating some confusion for people moving from those systems to unix.
I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess.
There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html :-). <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information.