[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

Fri Oct 3 13:09:15 CEST 2008

Simon Cross <hodgestar at gmail.com> added the comment:

I've tracked down the cause to the .unescape(...) method in HTMLParser.
The replaceEntities function passed to re.sub() always returns a unicode
character, even when matching string s is a byte string. Changing line
383 to:

  return self.entitydefs[s].encode("utf-8")

makes the test pass. Unfortunately this is obviously not a viable
solution in the general case. The problem is that there is no way to
know what character set to encode in without knowing both the HTTP
headers (which are not available to HTMLParser) and looking at the XML
and HTML headers.

Python 3.0 implicitly rejects non-unicode strings right at the start of
html.parser.HTMLParser.feed(...) by adding '' to the data passed in.

Given Python 3.0's behaviour, the docs should perhaps be updated to say
HTMLParser does not support non-unicode strings? If it should support
byte strings, we'll have to figure out how to handle encoded entity issues.

It's a bit weird that character and entity references outside
tags/attributes result in calls to .entityref(...) and .charref(...)
while those inside get unescape called automatically. Don't really see
what can be done about that though.

----------
versions: +Python 2.7

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3932>
_______________________________________