[ python-Bugs-1599325 ] htmlentitydefs.entitydefs assumes Latin-1 encoding

SourceForge.net noreply at sourceforge.net
Sun Nov 19 20:40:19 CET 2006


Bugs item #1599325, was opened at 2006-11-19 14:40
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding

Initial Comment:
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff.  This should be <= 0x7f.

As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'.  But this is only "true" in the Latin-1 encoding.  For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'.  While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding.

This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data.  The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470


More information about the Python-bugs-list mailing list