intolerant HTML parser
nobody at nowhere.com
Sun Feb 7 05:33:12 CET 2010
On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.
HTMLParser is a tokeniser, not a parser. It treats the data as a
stream of tokens (tags, entities, PCDATA, etc); it doesn't know anything
about the HTML DTD. For all it knows, the above example could be perfectly
valid (the "b" element might allow both its start and end tags to be
Does the validation need to be done in Python? If not, you can use
"nsgmls" to validate any SGML document for which you have a DTD. OpenSP
includes nsgmls along with the various HTML DTDs.
More information about the Python-list