intolerant HTML parser

Sat Feb 6 14:43:19 EST 2010

Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax.  So I am looking for a program that will complain at any
> syntax irregularities.
> 
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax.  I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.

    Try HTML5lib.

	http://code.google.com/p/html5lib/downloads/list

The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable".  For example, the common but
incorrect form of HTML comments,

	<- comment ->

is understood.

HTML5lib is slow, though.  Sometimes very slow.  It's really a reference 
implementation of the spec.  There's code like this:

     #Should speed up this check somehow (e.g. move the set to a constant)
             if ((0x0001 <= charAsInt <= 0x0008) or
                 (0x000E <= charAsInt <= 0x001F) or
                 (0x007F  <= charAsInt <= 0x009F) or
                 (0xFDD0  <= charAsInt <= 0xFDEF) or
                 charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
                                         0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
                                         0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
                                         0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
                                         0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
                                         0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
                                         0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                                         0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
                                         0xFFFFF, 0x10FFFE, 0x10FFFF])):
                 self.tokenQueue.append({"type": tokenTypes["ParseError"],
                                         "data":
                                          "illegal-codepoint-for-numeric-entity",
                                         "datavars": {"charAsInt": charAsInt}})

Every time through the loop (once per character), they build that frozen
set again.

				John Nagle