[Tutor] Beautiful Soup / Unicode problem?
kent37 at tds.net
Fri Aug 26 13:21:31 CEST 2005
> I'm having bang-my-head-against-a-wall moments trying to figure all of this out.
>>>>from BeautifulSoup import BeautifulSoup
>>>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
>>>>file = file.read().decode("utf-8")
>>>>soup = BeautifulSoup(file)
>>>>results = soup('p','g')
>>>>x = results.a.renderContents()
> <type 'unicode'>
> Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
> So far so good. But what I really want is just the text, so I try
> something like:
>>>>y = results.a.fetchText(re.compile('.+'))
> Traceback (most recent call last):
> File "<interactive input>", line 1, in ?
> File "BeautifulSoup.py", line 466, in fetchText
> return self.fetch(recursive=recursive, text=text, limit=limit)
> File "BeautifulSoup.py", line 492, in fetch
> return self._fetch(name, attrs, text, limit, generator)
> File "BeautifulSoup.py", line 194, in _fetch
> if self._matches(i, text):
> File "BeautifulSoup.py", line 252, in _matches
> chunk = str(chunk)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
> position 26: ordinal not in range(128)
> Is this a bug? Come to think of it, I'm not even sure how printing x
> worked, since it printed non-ascii characters.
This is the first question in the BeautifulSoup FAQ at http://www.crummy.com/software/BeautifulSoup/FAQ.html
Unfortunately the author of BS considers this a problem with your Python installation! So it seems he doesn't have a good understanding of Python and Unicode. (OK, I can forgive him that, I think there are only a handful of people who really do understand it completely.)
The first fix given doesn't work. The second fix works but it is not a good idea to change the default encoding for your Python install. There is a hack you can use to change the default encoding just for one program; in your program put
This seems to fix the problem you are having.
More information about the Tutor