HTML parsing bug?
Fredrik Lundh
fredrik at pythonware.com
Wed Feb 1 03:22:11 EST 2006
g_no_mail_please at yahoo.com wrote:
> Python 2.3.5 seems to choke when trying to parse html files, because it
> doesn't realize that what's inside <!-- --> is a comment in HTML,
> even if this comment is inside <script> </script>, especially if it's a
> comment inside that script code too.
nope. what's inside <!-- --> is not a comment if it's inside a <script>
or <style> tag. read the spec:
http://www.w3.org/TR/REC-html40/types.html#type-cdata
"Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by
user agents. Markup and entities must be treated as raw text and
passed to the application as is. The first occurrence of the
character sequence "</" (end-tag open delimiter) is treated as
terminating the end of the element's content. In valid documents,
this would be the end tag for the element."
in your case, the first occurrence of "</" is not the end tag.
you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by default, it is
set to
CDATA_CONTENT_ELEMENTS = ("script", "style")
setting it to an empty tuple disables HTML-compliant handling for these
elements:
p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())
</F>
More information about the Python-list
mailing list