[Python-bugs-list] [ python-Bugs-620243 ] HTMLParser:endtag events in comments

Thu, 14 Nov 2002 21:21:44 -0800

Bugs item #620243, was opened at 2002-10-08 10:11
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=620243&group_id=5470

Category: Python Library
Group: Python 2.2.2
>Status: Pending
Resolution: None
Priority: 5
Submitted By: June Kim (juneaftn)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser:endtag events in comments

Initial Comment:
HTMLParser triggers events when met closing tags
in comments.

>>> from HTMLParser import HTMLParser
>>> class P(HTMLParser):
	def handle_endtag(self,tag):
		print "ENDTAG",tag

>>> p=P()
>>> p.feed("""\
<html>
<body>
<script>
<!--
document.write('<h1>testing</h1>');
-->
</script>
</body>
</html>""")

ENDTAG h1
ENDTAG script
ENDTAG body
ENDTAG html

see http://groups.google.com/groups?
selm=evkjmuohcuosh0tqgn2li03kfo7qknatsp%
404ax.com

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-11-15 00:21

Message:
Logged In: YES 
user_id=3066

I'm not convinced this is a bug, but that's mainly due to 
details of the specification some people consider obscure, 
and to (ta da!)... the version of the HTML spec you look at!

Here's a quick synopsis; refer to the latest edition of the 
HTML 4 spec for more details.

There are two kinds of character data in HTML documents, 
PCDATA and CDATA.  Most is PCDATA, which means all 
markup constructs are allowed.  A few elements (SCRIPT 
and STYLE in particular) contain the more restrictive CDATA, 
which allows only end elements.  Since SCRIPT contains 
CDATA (in the more recent versions of HTML), comments are 
not recognized -- the characters '<!--' are just plain text, which 
HTMLParser gets right.  The end tag '</h1>' is an end tag, so 
it's a legal token at that position in the input document.  It is 
*not* legal in the HTML syntax, though: SCRIPT must have 
an explicit end tag, and no H1 was open anyway.  A proper 
HTML parser (based on SGML) would raise an error.

Now, the application we wrote HTMLParser for originally did 
not want to perform all the same checks, and the information 
provided is sufficient to allow an application to extend the 
parser to provide the right checks, so we figured that was 
good enough -- our app could enforce the checks it did care 
about, otherwise not mess with the provided HTML (we 
wanted round-tripability and non-interferance as much as 
possible).

So here the real catch:  different versions of HTML deal with 
the differently.  What's "right" depends on the version of the 
specification the input document is expected to conform to.  
For the most part, applications shouldn't really need to care, 
but we're seeing here what happens when incredibly lenient 
implementations become the norm, as often happens when 
we talk about "Internet time."  ;-(

HTML 3.2 and newer define SCRIPT and STYLE as CDATA, 
but earlier versions did not define them at all, so browsers (as 
the ultimately permissive parsers) simply ignored them, and 
treated their content as PCDATA.  So the comments were 
parsed as such.  When they were added, there was a desire 
to not require having to escape every random greater-than or 
less-than character in a script, so they were made CDATA.

So the result is that we do the right thing... but only for HTML 
3.2 and newer.  The behavior you're expecting would be 
reasonable for HTML 2.0... unless we threw an exception 
because there was an undefined element in the document in 
some hypothetical "strict mode."

So it's not clear that anything needs to be changed.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=620243&group_id=5470