HTML parsing bug?

g_no_mail_please at yahoo.com g_no_mail_please at yahoo.com
Mon Jan 30 09:45:28 EST 2006


Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!--      --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

The html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
<!--
// </ht ml>  - this is a comment in JavaScript, which is itself inside
an HTML comment
-->
</script>
</head>
<body>
    Hey there
</body>
</html>


The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())




More information about the Python-list mailing list