[New-bugs-announce] [issue6662] HTMLParser.HTMLParser doesn't handle malformed charrefs

Dave Day report at bugs.python.org
Fri Aug 7 03:25:10 CEST 2009

New submission from Dave Day <dayveday at gmail.com>:

When HTMLParser.HTMLParser encounters a malformed charref (for example 
&#bad;) it no longer parsers the following HTML correctly.

For example:
Recognises the starttag "p" but considers the rest to be data.

To reproduce:
class MyParser(HTMLParser.HTMLParser):
  def handle_starttag(self, tag, attrs):
    print 'Start "%s"' % tag
  def handle_endtag(self,tag):
    print 'End "%s"' % tag
  def handle_charref(self, ref):
    print 'Charref "%s"' % ref
  def handle_data(self, data):
    print 'Data "%s"' % data
parser = MyParser()

Expected output:
Start "p"
Data "&#bad;"
End "p"

Actual output:
Start "p"
Data "&#bad;</p>"

components: Library (Lib)
messages: 91392
nosy: dayveday
severity: normal
status: open
title: HTMLParser.HTMLParser doesn't handle malformed charrefs
type: behavior
versions: Python 2.4, Python 2.5

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list