intolerant HTML parser
Stefan Behnel
stefan_ml at behnel.de
Mon Feb 8 04:16:34 EST 2010
Jim, 06.02.2010 20:09:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
First thing to note here is that you should consider switching to an HTML
generation tool that does this automatically. Generating markup manually is
usually not a good idea.
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.
>
> That is, this:
> h=HTMLParser.HTMLParser()
> try:
> h.feed('<p>a<b>b</p></b>')
> h.close()
> print "I expect not to see this line"
> except Exception, err:
> print "exception:",str(err)
> gives me "I expect not to see this line".
>
> Am I using that routine incorrectly? Is there a natural Python choice
> for this job?
You can use lxml and let it validate the HTML output against the HTML DTD.
Just load the DTD from a catalog using the DOCTYPE in the document (see the
'docinfo' property on the parse tree).
http://codespeak.net/lxml/validation.html#id1
Note that when parsing the HTML file, you should disable the parser failure
recovery to make sure it barks on syntax errors instead of fixing them up.
http://codespeak.net/lxml/parsing.html#parser-options
http://codespeak.net/lxml/parsing.html#parsing-html
Stefan
More information about the Python-list
mailing list