[ python-Bugs-1051840 ] HTMLParser doesn't treat endtags in <script> tags as CDATA

SourceForge.net noreply at sourceforge.net
Fri Oct 22 01:04:52 CEST 2004


Bugs item #1051840, was opened at 2004-10-21 16:02
Message generated for change (Comment added) made by neptune235
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Luke Bradley (neptune235)
Assigned to: Nobody/Anonymous (nobody)
Summary: HTMLParser doesn't treat endtags in <script> tags as CDATA

Initial Comment:
HTMLParser.HTMLParser in Python 2.3.4 calls
self.handle_endtag() for end tags within script and
style sections, which it should not, because the
content is supposed to be CDATA, as defined in
CDATA_CONTENT_ELEMENTS within HTMLParser. The following
script will demonstrate this problem:

import HTMLParser

class MyHandler(HTMLParser.HTMLParser):
    tags = []
    def handle_starttag(self, tag, attr):
        self.tags.append(tag)
    def handle_endtag(self, tag):
        if tag != self.tags[-1]:
            #this should never happen in a well formed
document
            raise "Not well-formed, endtag '" + tag +
"' doesn't match starttag '" + self.lasttag + "'"
        self.tags.pop(-1)

s = """
<html>
    <body>
    This page is completely well formed
        <script language="javascript">
            alert("</a></a>");
        </script>
        blah blah
    </body>
</html>
"""

m = MyHandler()
m.feed(s)

This will raise an exception. I fixed the bug by
changing the parse_endtag function on line 318 of
HTMLParser to the following:

def parse_endtag(self, i):
    rawdata = self.rawdata
    assert rawdata[i:i+2] == "</", "unexpected call to
parse_endtag"
    match = endendtag.search(rawdata, i+1) # >
    if not match:
        return -1
    j = match.end()
    match = endtagfind.match(rawdata, i) # </ + tag + >
    if not match:
        self.error("bad end tag: %s" % `rawdata[i:j]`)
    tag = match.group(1)
    #START BUGFIX
    if self.interesting == interesting_cdata:
        #we're in of of the CDATA_CONTENT_ELEMENTS
        if tag == self.lasttag and tag in
self.CDATA_CONTENT_ELEMENTS:
            #its the end of the CDATA_CONTENT_ELEMENTS
tag we are in.
            self.handle_endtag(tag.lower())
            self.clear_cdata_mode()#backto normal mode
        else:
            #we're inside the CDATA_CONTENT_ELEMENTS
tag still. throw the tag to handle_data instead.
            self.handle_data(match.group())
    else:
        #we're not in a CDATA_CONTENT_ELEMENTS tag.
standard ending:
        self.handle_endtag(tag.lower())
    return j


----------------------------------------------------------------------

>Comment By: Luke Bradley (neptune235)
Date: 2004-10-21 16:04

Message:
Logged In: YES 
user_id=178561

oops, I didn't know this would remove indentation. Let me
attach a file.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1051840&group_id=5470


More information about the Python-bugs-list mailing list