Help w/ HTMLParser lib

Carl Banks imbosol at aerojockey.invalid
Fri May 21 01:17:00 EDT 2004


Kevin T. Ryan wrote:
> Hi all - 
> 
> I'm somewhat new to python (about 1 year), and I'm trying to write a program
> that opens a file like object w/ urllib.urlopen, and then parse the data by
> passing it to a class that subclasses HTMLParser.HTMLParser.  On the web
> page, however, there is javascript - and I think that is causing an error
> in parsing the data.  Here's the error:
> 
> Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "html_helper.py", line 30, in parse_data
>    p.feed(data)
>  File "//usr/lib/python2.2/HTMLParser.py", line 108, in feed
>    self.goahead(0)
>  File "//usr/lib/python2.2/HTMLParser.py", line 150, in goahead
>    k = self.parse_endtag(i)
>  File "//usr/lib/python2.2/HTMLParser.py", line 329, in parse_endtag
>    self.error("bad end tag: %s" % `rawdata[i:j]`)
>  File "//usr/lib/python2.2/HTMLParser.py", line 115, in error
>    raise HTMLParseError(message, self.getpos())
> HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line 411,
> column 7
> 
> I've tried to use a try/except clause both w/in my class and w/in a function
> that wraps the class for easy access, but to no avail.  The code works on
> other websites, so I know that it's not *completely* off.  Any help would
> be greatly appreciated!  TIA :)


You might be out of luck as far as HTMLParser goes.  HTMLParser thinks
that's a closing tag (an illegal one), and there's no way to shut off
closing tags.

I suggest you work around it by removing the script tag before feeding
the file to HTMLParser.  If you feed the file one line at a time, then
search for the string '<script>'.  If it's there, feed only the part
of the line before it to HTMLParser, then scan for the closing tag
yourself, and when you find it, only feed the part after it to
HTMLParser, doing nothing with the stuff in between.  Here is a HIGHLY
UNTESTED example:

    _scriptopen = re.compile(r"<\s*script[^<>]*>")
    _scriptclose = re.compile(r"</\s*script\s*>")
    m = _scriptopen.search(line)
    if m:
        parserobject.feed(line[:m.start()])
        line = line[m.end():]
        while 1:
            m2 = _scriptclose.search(line)
            if m2:
                parserobject.feed(line[m.end():])
                break
            line = urllibobject.readline()
            if not line:
                break
    else:
        parserobject.feed(line)


It's not good HTML, but (once it's debugged) it'll work most of the
time as a practical matter.  If you feed the whole file at once, then
you could maybe do it with one regexp (again HIGHLY UNTESTED):

    _scripttag = re.compile(r"<\s*script[^<>]*>.*?</\s*script\*>",re.DOTALL)
    _scripttag.replace('',buffer)


-- 
CARL BANKS                      http://www.aerojockey.com/software
"If you believe in yourself, drink your school, stay on drugs, and
don't do milk, you can get work." 
          -- Parody of Mr. T from a Robert Smigel Cartoon



More information about the Python-list mailing list