HTML parsing bug?

g_no_mail_please at g_no_mail_please at
Mon Jan 30 09:45:28 EST 2006

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!--      --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

The html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
// </ht ml>  - this is a comment in JavaScript, which is itself inside
an HTML comment
    Hey there

The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()

More information about the Python-list mailing list