[Tutor] Beautiful Soup / Unicode problem?
Kent Johnson
kent37 at tds.net
Fri Aug 26 13:21:31 CEST 2005
grouchy wrote:
> Hi,
>
> I'm having bang-my-head-against-a-wall moments trying to figure all of this out.
>
>>>>from BeautifulSoup import BeautifulSoup
>>>
>>>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
>>>>file = file.read().decode("utf-8")
>>>>soup = BeautifulSoup(file)
>>>>results = soup('p','g')
>>>>x = results[1].a.renderContents()
>>>>type(x)
>
> <type 'unicode'>
>
>>>>print x
>
> Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
>
> So far so good. But what I really want is just the text, so I try
> something like:
>
>
>>>>y = results[1].a.fetchText(re.compile('.+'))
>
> Traceback (most recent call last):
> File "<interactive input>", line 1, in ?
> File "BeautifulSoup.py", line 466, in fetchText
> return self.fetch(recursive=recursive, text=text, limit=limit)
> File "BeautifulSoup.py", line 492, in fetch
> return self._fetch(name, attrs, text, limit, generator)
> File "BeautifulSoup.py", line 194, in _fetch
> if self._matches(i, text):
> File "BeautifulSoup.py", line 252, in _matches
> chunk = str(chunk)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
> position 26: ordinal not in range(128)
>
> Is this a bug? Come to think of it, I'm not even sure how printing x
> worked, since it printed non-ascii characters.
This is the first question in the BeautifulSoup FAQ at http://www.crummy.com/software/BeautifulSoup/FAQ.html
Unfortunately the author of BS considers this a problem with your Python installation! So it seems he doesn't have a good understanding of Python and Unicode. (OK, I can forgive him that, I think there are only a handful of people who really do understand it completely.)
The first fix given doesn't work. The second fix works but it is not a good idea to change the default encoding for your Python install. There is a hack you can use to change the default encoding just for one program; in your program put
reload(sys); sys.setdefaultencoding('utf-8')
This seems to fix the problem you are having.
Kent
More information about the Tutor
mailing list