Hi, Mike Meyer wrote:
Not necessarily. libxml2 uses the c libraries free/malloc. Historically, on Unix systems the C library free/malloc don't return the memory to the OS, but keep it in an internal heap. Systems that are Not Unix tend to do otherwise, creating some confusion for people moving from those systems to unix.
I tend to consider libc a part of the OS. But technically you are right and it even makes a difference here.
I haven't looked hard (yet, at least). The example on the ElementSoup page parses *slightly* better with BS, but lxml parses it very similarly to how html5lib parses it, which I'd consider the better standard. html5lib has the advantage of being a kind of standard.
If I had a good collection of crappy HTML, that would probably be an interesting test to see how differently html5lib, BS, and lxml parse it. I'm not sure where to find a good collection like that. Maybe html5lib's tests, I guess. There seem to be a fair amount of HTML browser compliance test suites on the web, but I didn't find any test suites for broken HTML at a first glance.
I think google has a nice collection of broken html :-).
Hmmm, do you want us to ask them? Or maybe ask their cache instead? I just don't know how to write a Google search query for broken HTML pages... :) Anyway, I'm not sure they actually keep the broken HTML pages around. I would expect them to send them through a sanitizer before doing anything else with them (including local caching). Stefan