intolerant HTML parser

Mon Feb 8 04:16:34 EST 2010

Jim, 06.02.2010 20:09:
> I generate some HTML and I want to include in my unit tests a check
> for syntax.  So I am looking for a program that will complain at any
> syntax irregularities.

First thing to note here is that you should consider switching to an HTML
generation tool that does this automatically. Generating markup manually is
usually not a good idea.

> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax.  I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.
> 
> That is, this:
>         h=HTMLParser.HTMLParser()
>         try:
>             h.feed('<p>a<b>b</p></b>')
>             h.close()
>             print "I expect not to see this line"
>         except Exception, err:
>             print "exception:",str(err)
> gives me "I expect not to see this line".
> 
> Am I using that routine incorrectly?  Is there a natural Python choice
> for this job?

You can use lxml and let it validate the HTML output against the HTML DTD.
Just load the DTD from a catalog using the DOCTYPE in the document (see the
'docinfo' property on the parse tree).

http://codespeak.net/lxml/validation.html#id1

Note that when parsing the HTML file, you should disable the parser failure
recovery to make sure it barks on syntax errors instead of fixing them up.

http://codespeak.net/lxml/parsing.html#parser-options
http://codespeak.net/lxml/parsing.html#parsing-html

Stefan