intolerant HTML parser
John Nagle
nagle at animats.com
Sat Feb 6 14:43:19 EST 2010
Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
> didn't complain.
Try HTML5lib.
http://code.google.com/p/html5lib/downloads/list
The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable". For example, the common but
incorrect form of HTML comments,
<- comment ->
is understood.
HTML5lib is slow, though. Sometimes very slow. It's really a reference
implementation of the spec. There's code like this:
#Should speed up this check somehow (e.g. move the set to a constant)
if ((0x0001 <= charAsInt <= 0x0008) or
(0x000E <= charAsInt <= 0x001F) or
(0x007F <= charAsInt <= 0x009F) or
(0xFDD0 <= charAsInt <= 0xFDEF) or
charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
0xFFFFF, 0x10FFFE, 0x10FFFF])):
self.tokenQueue.append({"type": tokenTypes["ParseError"],
"data":
"illegal-codepoint-for-numeric-entity",
"datavars": {"charAsInt": charAsInt}})
Every time through the loop (once per character), they build that frozen
set again.
John Nagle
More information about the Python-list
mailing list