[Python-bugs-list] [ python-Bugs-803422 ] sgmllib doesn't support
hex or Unicode character references
SourceForge.net
noreply at sourceforge.net
Tue Sep 9 15:00:34 EDT 2003
Bugs item #803422, was opened at 2003-09-09 15:53
Message generated for change (Comment added) made by aaronsw
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=803422&group_id=5470
Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Aaron Swartz (aaronsw)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib doesn't support hex or Unicode character references
Initial Comment:
sgmllib doesn't support the hexadecimal style of character nor
Unicode characters, both of which are commonly seen on web pages.
The following replacements fix both problems.
charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]')
def handle_charref(self, ref):
try:
if ref[0] == 'x' or ref[0] == 'X': m =
int(ref[1:], 16)
else: m = int(ref)
self.handle_data(unichr(m).encode('utf-8'))
except ValueError:
self.unknown_charref(ref)
----------------------------------------------------------------------
>Comment By: Aaron Swartz (aaronsw)
Date: 2003-09-09 16:00
Message:
Logged In: YES
user_id=122141
Oops, that should be:
charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=803422&group_id=5470
More information about the Python-bugs-list
mailing list