BeautifulSoup vs. real-world HTML comments
paul at boddie.org.uk
Wed Apr 4 22:46:47 CEST 2007
John Nagle wrote:
> The syntax that browsers understand as HTML comments is much less
> restrictive than what BeautifulSoup understands. I keep running into
> sites with formally incorrect HTML comments which are parsed happily
> by browsers. Here's yet another example, this one from
> "http://www.webdirectory.com". The page starts like this:
> <!Hello there! Welcome to The Environment Directory!>
> <!Not too much exciting HTML code here but it does the job! >
> <!See ya, - JD >
Anything based on libxml2 and its HTML parser will handle such broken
HTML just fine, even if they just ignore such erroneous attempts at
comments, discarding them as the plain nonsense they clearly are.
Certainly, libxml2dom seems to deal with the page:
d = libxml2dom.parseURI("http://www.webdirectory.com", html=1,
I guess lxml and the original libxml2 bindings work at least as well.
Note that some browsers won't be as happy if you give them such
content as XHTML.
More information about the Python-list