[ python-Bugs-1051840 ] HTMLParser doesn't treat endtags in <script> tags as CDATA

Tue May 2 22:15:37 CEST 2006

Bugs item #1051840, was opened at 2004-10-21 19:02
Message generated for change (Settings changed) made by fdrake
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Luke Bradley (neptune235)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser doesn't treat endtags in <script> tags as CDATA

Initial Comment:
HTMLParser.HTMLParser in Python 2.3.4 calls
self.handle_endtag() for end tags within script and
style sections, which it should not, because the
content is supposed to be CDATA, as defined in
CDATA_CONTENT_ELEMENTS within HTMLParser. The following
script will demonstrate this problem:

import HTMLParser

class MyHandler(HTMLParser.HTMLParser):
    tags = []
    def handle_starttag(self, tag, attr):
        self.tags.append(tag)
    def handle_endtag(self, tag):
        if tag != self.tags[-1]:
            #this should never happen in a well formed
document
            raise "Not well-formed, endtag '" + tag +
"' doesn't match starttag '" + self.lasttag + "'"
        self.tags.pop(-1)

s = """
<html>
    <body>
    This page is completely well formed
        <script language="javascript">
            alert("</a></a>");
        </script>
        blah blah
    </body>
</html>
"""

m = MyHandler()
m.feed(s)

This will raise an exception. I fixed the bug by
changing the parse_endtag function on line 318 of
HTMLParser to the following:

def parse_endtag(self, i):
    rawdata = self.rawdata
    assert rawdata[i:i+2] == "</", "unexpected call to
parse_endtag"
    match = endendtag.search(rawdata, i+1) # >
    if not match:
        return -1
    j = match.end()
    match = endtagfind.match(rawdata, i) # </ + tag + >
    if not match:
        self.error("bad end tag: %s" % `rawdata[i:j]`)
    tag = match.group(1)
    #START BUGFIX
    if self.interesting == interesting_cdata:
        #we're in of of the CDATA_CONTENT_ELEMENTS
        if tag == self.lasttag and tag in
self.CDATA_CONTENT_ELEMENTS:
            #its the end of the CDATA_CONTENT_ELEMENTS
tag we are in.
            self.handle_endtag(tag.lower())
            self.clear_cdata_mode()#backto normal mode
        else:
            #we're inside the CDATA_CONTENT_ELEMENTS
tag still. throw the tag to handle_data instead.
            self.handle_data(match.group())
    else:
        #we're not in a CDATA_CONTENT_ELEMENTS tag.
standard ending:
        self.handle_endtag(tag.lower())
    return j

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2006-05-02 16:15

Message:
Logged In: YES 
user_id=3066

This is a common complaint (because no-one reads the specs),
but since people have lived with it this long, I'm inclined
to leave it alone.  If people want to read every two-bit
piece of broken HTML, they can use BeautifulSoup, which
handles that task quite nicely.

Rejecting as "don't go there."

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-10-23 20:46

Message:
Logged In: YES 
user_id=80475

Fred, what do you think?

----------------------------------------------------------------------

Comment By: Luke Bradley (neptune235)
Date: 2004-10-22 19:52

Message:
Logged In: YES 
user_id=178561

<i>Although a fix may be worthwhile, as this happens a lot in 
practice, HTMLParser is following the letter of the law in 
throwing exceptions on pages that aren't strictly valid. 

http://www.w3.org/TR/html4/appendix/notes.html#notes-
specifying-data</i>

Well you're right, I'll be damned! 
Hmm. I want to use HTMLParser to access other people's pages
on the net, and I can't fix their bad HTML. The problem here
is I'm not sure how to handle this at the level of my
Handler, without inadvertantly changing thier javascript by
doing something like:
handle_data("</" + tag + ">")
in the handle_entag event. Which lowercases the tag. Is
there a way to access the literal string of the endtag in my
handler I wonder? If not, it might be useful to add some
functionality to HTMLParser that allows us to handle invalid
HTML at the level of our handler without sacrificing
HTMLParsers commitment to standards compliance. 

----------------------------------------------------------------------

Comment By: Richard Brodie (leogah)
Date: 2004-10-22 14:02

Message:
Logged In: YES 
user_id=356893

Although a fix may be worthwhile, as this happens a lot in 
practice, HTMLParser is following the letter of the law in 
throwing exceptions on pages that aren't strictly valid. 

http://www.w3.org/TR/html4/appendix/notes.html#notes-
specifying-data

----------------------------------------------------------------------

Comment By: Luke Bradley (neptune235)
Date: 2004-10-21 19:04

Message:
Logged In: YES 
user_id=178561

oops, I didn't know this would remove indentation. Let me
attach a file.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470