[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )
report at bugs.python.org
Wed Dec 22 17:44:50 CET 2010
Martin Potthast <martin.potthast at googlemail.com> added the comment:
Agreed. Here's a patch for HTMLParser. That was easy enough.
With regard to tests, there seems to be already one called test_malformatted_charref in test_htmlparser.py. However, the test tests the whole parser and not only HTMLParser.unescape().
At the same time, HTMLParser.unescape() has the following comment:
"# Internal -- helper to remove special character quoting"
It appears the syntax check is done in line 168 already, but since the unescape function is publicly visible, I'd say that it should be capable of handling all kinds of malformed input, despite that comment. Maybe this comment should be removed.
I'm not entirely sure how to write the test properly, since it doesn't fit into the framework provided by test_htmlparser.py; and unfortunately, my time is rather short at the moment.
Added file: http://bugs.python.org/file20141/HTMLParser.py.diff
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list