Found a parsing bug in HTMLParser

Bengt Richter bokr at
Sun Feb 9 22:38:36 CET 2003

On Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz <gradha at> wrote:

>I've found a bug in HTMLParser parsing some of my webpages. The
>problem is using an attribute with a value inside double quotes
>which is near another attribute. I've created a small testcase
Too "near" to be legal HTML 4.0, I believe. From the spec:
3.2.2 Attributes

Elements may have associated properties, called attributes, which may have values
(by default, or set by authors or scripts). Attribute/value pairs appear before
the final ">" of an element's start tag. Any number of (legal) attribute value pairs,
separated by spaces, may appear in an element's start tag. They may appear in any order.
Your DTD specification is HTML 4.0, but even if it's trying to do new XHTML stuff,
XML requires a space before each attribute definition, i.e.,
from my XML spec copy of

    STag ::= '<' Name (S Attribute)* S? '>'
    S    ::=  (#x20 | #x9 | #xD | #xA)

so it surprises me that you get an ok validation, though I'm not surprised
that browsers ignore anomalies.

>which you can see below. The w3c validator says the page is ok
>and browsers render it without problems.  Does it happen with newer
>Python versions? What's the procedure for bug reports?
>PD: Don't CC me your replies.
>$ cat test.html
><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
><a href="http://ss"title="pe">P</a>
                    ^^^^^^^^^^ -- need white space in front of this, e.g.,
 <a href="http://ss" title="pe">P</a>
>$ python
>Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
>[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> from HTMLParser import HTMLParser
>>>> p = HTMLParser()
>>>> file = open("test.html", "rt")
>>>> p.feed("".join(file.readlines()))
>>>> file.close()
>>>> p.close()
>Traceback (most recent call last):
>  File "<stdin>", line 1, in ?
>  File "/usr/lib/python2.2/", line 112, in close
>    self.goahead(1)
>  File "/usr/lib/python2.2/", line 166, in goahead
>    self.error("EOF in middle of construct")
>  File "/usr/lib/python2.2/", line 115, in error
>    raise HTMLParseError(message, self.getpos())
>HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
Seems like a better message could have been generated, though.
Bengt Richter

More information about the Python-list mailing list