[ python-Bugs-1599325 ] htmlentitydefs.entitydefs assumes Latin-1 encoding
SourceForge.net
noreply at sourceforge.net
Sun Nov 19 20:40:19 CET 2006
Bugs item #1599325, was opened at 2006-11-19 14:40
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding
Initial Comment:
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff. This should be <= 0x7f.
As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'. But this is only "true" in the Latin-1 encoding. For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'. While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding.
This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data. The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470
More information about the Python-bugs-list
mailing list