[Tutor] Beautiful Soup / Unicode problem?

Kent Johnson kent37 at tds.net
Fri Aug 26 13:21:31 CEST 2005

grouchy wrote:
> Hi,
> I'm having bang-my-head-against-a-wall moments trying to figure all of this out.
>>>>from BeautifulSoup import BeautifulSoup
>>>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
>>>>file = file.read().decode("utf-8")
>>>>soup = BeautifulSoup(file)
>>>>results = soup('p','g') 
>>>>x = results[1].a.renderContents()
> <type 'unicode'>
>>>>print x
> Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
> So far so good.  But what I really want is just the text, so I try
> something like:
>>>>y = results[1].a.fetchText(re.compile('.+'))
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
>   File "BeautifulSoup.py", line 466, in fetchText
>     return self.fetch(recursive=recursive, text=text, limit=limit)
>   File "BeautifulSoup.py", line 492, in fetch
>     return self._fetch(name, attrs, text, limit, generator)
>   File "BeautifulSoup.py", line 194, in _fetch
>     if self._matches(i, text):
>   File "BeautifulSoup.py", line 252, in _matches
>     chunk = str(chunk)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
> position 26: ordinal not in range(128)
> Is this a bug?  Come to think of it, I'm not even sure how printing x
> worked, since it printed non-ascii characters.

This is the first question in the BeautifulSoup FAQ at http://www.crummy.com/software/BeautifulSoup/FAQ.html

Unfortunately the author of BS considers this a problem with your Python installation! So it seems he doesn't have a good understanding of Python and Unicode. (OK, I can forgive him that, I think there are only a handful of people who really do understand it completely.)

The first fix given doesn't work. The second fix works but it is not a good idea to change the default encoding for your Python install. There is a hack you can use to change the default encoding just for one program; in your program put
  reload(sys); sys.setdefaultencoding('utf-8')

This seems to fix the problem you are having.


More information about the Tutor mailing list