[issue3932] HTMLParser cannot handle '&' and non-ascii characters in attribute names

Sun Nov 6 23:06:56 CET 2011

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

I'm not sure what is the best solution here.

unescape uses a regex with replaceEntities as callback to replace the entities in attribute values.
The problem is that replaceEntities currently returns unicode, and if unescape receives a str, an automatic coercion to unicode happens and an error is raised whenever the str is non-ascii.

The possible solutions are:
 1) Document the status quo (i.e replaceEntities always returns unicode, and an error is raised whenever a string that contains non-ascii chars is passed);
 2) Change replaceEntities to return str only for ascii chars (as the patch proposed by Zbigniew does).  This works as long as the entity resolves to an ascii character, but keep failing for the other cases.

The first option is cleaner, and means that if you want to parse something you should always use unicode, otherwise it might fail (In case of ambiguity, refuse the temptation to guess).
The second option might allow you to parse a few more documents without converting them to unicode, but only if you are lucky (i.e. you don't get any unicode mixed with non-ascii str).  If most of the entities in attributes resolve to ascii (e.g. &quote; &amp; &apos; &gt; &lt;), it might be more practical to return str and avoid unnecessary errors, while still adding a note in documentation that passing unicode is better.

----------
nosy: +ezio.melotti, r.david.murray
type:  -> behavior
versions:  -Python 2.6

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3932>
_______________________________________