[ python-Bugs-1599325 ] htmlentitydefs.entitydefs assumes Latin-1 encoding

SourceForge.net noreply at sourceforge.net
Sun Nov 19 20:59:00 CET 2006


Bugs item #1599325, was opened at 2006-11-19 20:40
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
>Status: Closed
>Resolution: Invalid
Priority: 5
Private: No
Submitted By: Erik Demaine (edemaine)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmlentitydefs.entitydefs assumes Latin-1 encoding

Initial Comment:
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff.  This should be <= 0x7f.

As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'.  But this is only "true" in the Latin-1 encoding.  For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'.  While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding.

This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data.  The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2006-11-19 20:59

Message:
Logged In: YES 
user_id=21627
Originator: NO

This is not a bug. entitydefs is specified to contain Latin-1 byte strings
in its documentation, and many applications rely on that.

If you have different processing needs, you may want to use
htmlentitydefs.name2codepoint instead, or derive yet another table
automatically from it.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1599325&group_id=5470


More information about the Python-bugs-list mailing list