Re: [lxml-dev] Some benchmarks

11 Mar 2008

      Hi,

Mike Meyer wrote:
...
Not necessarily. libxml2 uses the c libraries
free/malloc. Historically, on Unix systems the C library free/malloc
don't return the memory to the OS, but keep it in an internal
heap. Systems that are Not Unix tend to do otherwise, creating some
confusion for people moving from those systems to unix.
I tend to consider libc a part of the OS. But technically you are right and it
even makes a difference here.
...
...
...
I haven't looked hard (yet, at least).  The example on the ElementSoup
page parses *slightly* better with BS, but lxml parses it very similarly
to how html5lib parses it, which I'd consider the better standard.
html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an
interesting test to see how differently html5lib, BS, and lxml parse it.
 I'm not sure where to find a good collection like that.  Maybe
html5lib's tests, I guess.
There seem to be a fair amount of HTML browser compliance test suites on the
web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html  :-).
Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just
don't know how to write a Google search query for broken HTML pages... :)

Anyway, I'm not sure they actually keep the broken HTML pages around. I would
expect them to send them through a sanitizer before doing anything else with
them (including local caching).

Stefan

Re: [lxml-dev] Some benchmarks

Stefan Behnel