Unicode problem in BeautifulSoup; worked in Python 2.4, fails in Python 2.5.
Mizipzor
mizipzor at gmail.com
Sun Feb 4 17:47:54 EST 2007
On Feb 4, 11:39 pm, John Nagle <n... at animats.com> wrote:
> I'm running a website page through BeautifulSoup. It parses OK
> with Python 2.4, but Python 2.5 fails with an exception:
>
> Traceback (most recent call last):
> File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
> self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
> File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
> BeautifulStoneSoup.__init__(self, *args, **kwargs)
> File "./sitetruth/BeautifulSoup.py", line 973, in __init__
> self._feed()
> File "./sitetruth/BeautifulSoup.py", line 998, in _feed
> SGMLParser.feed(self, markup or "")
> File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
> self.goahead(0)
> File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
> k = self.parse_starttag(i)
> File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
> self.finish_starttag(tag, attrs)
> File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
> self.handle_starttag(tag, method, attrs)
> File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
> method(attrs)
> File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
> self._feed(self.declaredHTMLEncoding)
> File "./sitetruth/BeautifulSoup.py", line 998, in _feed
> SGMLParser.feed(self, markup or "")
> File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
> self.goahead(0)
> File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
> k = self.parse_starttag(i)
> File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
> self._convert_ref, attrvalue)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal
> not in range(128)
>
> The code that's failing is in "_convert_ref", which is new in Python 2.5.
> That function wasn't present in 2.4. I think the code is trying to
> handle single quotes inside of double quotes, or something like that.
>
> To replicate, run
>
> http://www.bankofamerica.com
> or
> http://www.gm.com
>
> through BeautifulSoup.
>
> Something about this code doesn't like big companies. Web sites of smaller
> companies are going through OK.
>
> Also reported as a bug:
>
> [ 1651995 ] sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
>
> John Nagle
I think this post got rather missplaced, hehe.
More information about the Python-list
mailing list