[ python-Bugs-1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.

Wed Feb 7 08:57:19 CET 2007

Bugs item #1651995, was opened at 2007-02-04 22:34
Message generated for change (Comment added) made by nagle
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2.

Initial Comment:
   I'm running a website page through BeautifulSoup.  It parses OK with Python 2.4, but Python 2.5 fails with an exception:

Traceback (most recent call last):
  File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
  File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    self._feed()
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
  File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    self._feed(self.declaredHTMLEncoding)
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)

    The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4.  I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that.

    To replicate, run

	http://www.bankofamerica.com
or
	http://www.gm.com

through BeautifulSoup.  

Something about this code doesn't like big companies. Web sites of smaller companies are going through OK.

----------------------------------------------------------------------

>Comment By: John Nagle (nagle)
Date: 2007-02-07 07:57

Message:
Logged In: YES 
user_id=5571
Originator: YES

Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the
code for handling character escapes assumes that ASCII characters have
values up to 255.
But the correct limit is 127, of course.

If a Unicode string is run through SGMLparser, and that string has a
character in an attribute with a value between 128 and 255, which is valid
in Unicode, the
value is passed through as a character with "chr", creating a
one-character invalid ASCII string.  

Then, when the bad string is later converted to Unicode as the output is
assembled, the UnicodeDecodeError exception is raised. 

So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below.  This forces characters above 127 to be expressed with
escape sequences.  Please patch accordingly.  Thanks.

def convert_charref(self, name):
    """Convert character reference, may be overridden."""
    try:
        n = int(name)
    except ValueError:
        return
    if not 0 <= n <= 127 : # ASCII ends at 127, not 255
        return
    return self.convert_codepoint(n)

----------------------------------------------------------------------

Comment By: wrstl prmpft (wrstlprmpft)
Date: 2007-02-05 07:16

Message:
Logged In: YES 
user_id=801589
Originator: NO

I had a similar problem recently and did not have time to file a
bug-report. Thanks for doing that.

The problem is the code that handles entity and character references in
SGMLParser.parse_starttag. Seems that it is not careful about unicode/str
issues.
(But maybe Beautifulsoup needs to tell it to?)

My quick'n'dirty workaround was to remove the offending char-entity from
the website before feeding it to Beautifulsoup::

  text = text.replace('&#174;', '') # remove rights reserved sign entity

cheers,
stefan

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470