[ python-Bugs-1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.
SourceForge.net
noreply at sourceforge.net
Wed Feb 7 08:57:19 CET 2007
Bugs item #1651995, was opened at 2007-02-04 22:34
Message generated for change (Comment added) made by nagle
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2.
Initial Comment:
I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception:
Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "./sitetruth/BeautifulSoup.py", line 973, in __init__
self._feed()
File "./sitetruth/BeautifulSoup.py", line 998, in _feed
SGMLParser.feed(self, markup or "")
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
method(attrs)
File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
self._feed(self.declaredHTMLEncoding)
File "./sitetruth/BeautifulSoup.py", line 998, in _feed
SGMLParser.feed(self, markup or "")
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)
The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4. I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that.
To replicate, run
http://www.bankofamerica.com
or
http://www.gm.com
through BeautifulSoup.
Something about this code doesn't like big companies. Web sites of smaller companies are going through OK.
----------------------------------------------------------------------
>Comment By: John Nagle (nagle)
Date: 2007-02-07 07:57
Message:
Logged In: YES
user_id=5571
Originator: YES
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the
code for handling character escapes assumes that ASCII characters have
values up to 255.
But the correct limit is 127, of course.
If a Unicode string is run through SGMLparser, and that string has a
character in an attribute with a value between 128 and 255, which is valid
in Unicode, the
value is passed through as a character with "chr", creating a
one-character invalid ASCII string.
Then, when the bad string is later converted to Unicode as the output is
assembled, the UnicodeDecodeError exception is raised.
So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below. This forces characters above 127 to be expressed with
escape sequences. Please patch accordingly. Thanks.
def convert_charref(self, name):
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
if not 0 <= n <= 127 : # ASCII ends at 127, not 255
return
return self.convert_codepoint(n)
----------------------------------------------------------------------
Comment By: wrstl prmpft (wrstlprmpft)
Date: 2007-02-05 07:16
Message:
Logged In: YES
user_id=801589
Originator: NO
I had a similar problem recently and did not have time to file a
bug-report. Thanks for doing that.
The problem is the code that handles entity and character references in
SGMLParser.parse_starttag. Seems that it is not careful about unicode/str
issues.
(But maybe Beautifulsoup needs to tell it to?)
My quick'n'dirty workaround was to remove the offending char-entity from
the website before feeding it to Beautifulsoup::
text = text.replace('®', '') # remove rights reserved sign entity
cheers,
stefan
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470
More information about the Python-bugs-list
mailing list