[ python-Bugs-1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.

SourceForge.net noreply at sourceforge.net
Sun Feb 4 23:34:48 CET 2007


Bugs item #1651995, was opened at 2007-02-04 22:34
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib _convert_ref UnicodeDecodeError exception, new in 2.

Initial Comment:
   I'm running a website page through BeautifulSoup.  It parses OK with Python 2.4, but Python 2.5 fails with an exception:

Traceback (most recent call last):
  File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
  File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    self._feed()
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
  File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    self._feed(self.declaredHTMLEncoding)
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)

    The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4.  I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that.

    To replicate, run

	http://www.bankofamerica.com
or
	http://www.gm.com

through BeautifulSoup.  

Something about this code doesn't like big companies. Web sites of smaller companies are going through OK.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1651995&group_id=5470


More information about the Python-bugs-list mailing list