[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

Yotam Medini report at bugs.python.org
Thu Sep 30 23:50:06 CEST 2010

Yotam Medini <yotam at users.sourceforge.net> added the comment:

The HTMLParser.py fails when inside 
  <script> ... </script>
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:

 $ tar tvzf lt-in-script-example.tgz | cut -c24-
     796 2010-09-30 16:52 h2t.py
   23678 2010-09-30 16:39 t.html

here's what happens:

 $ python h2t.py t.html /tmp/t.txt
 HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
 Traceback (most recent call last):
   File "h2t.py", line 31, in <module>
     text = html2text(f_html.read())
   File "h2t.py", line 23, in html2text
     te = TextExtractor(html)
   File "h2t.py", line 15, in __init__
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
     k = self.parse_starttag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
     endpos = self.check_for_whole_start_tag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
     self.error("malformed start tag")
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
     raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332

I have a suggested patch 
fixing this problem, soon to be attached.

-- yotam

nosy: +yotam
Added file: http://bugs.python.org/file19072/lt-in-script-example.tgz

Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list