[Tutor] Beautiful Soup / Unicode problem?

grouchy grouch at gmail.com
Thu Aug 25 20:24:38 CEST 2005


Hi,

I'm having bang-my-head-against-a-wall moments trying to figure all of this out.

A word of warming, this is the first time I've tried using unicode, or
Beautiful Soup, so if I'm being stupid, please forgive me.  I'm trying
to scrape results from google as a test case. with Beautiful Soup. 
I've seen people recommend it here, so maybe somebody can recognize
what I'm doing wrong:

>>>from BeautifulSoup import BeautifulSoup
>>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
>>>file = file.read().decode("utf-8")
>>>soup = BeautifulSoup(file)
>>>results = soup('p','g') 
>>> x = results[1].a.renderContents()
>>> type(x)
<type 'unicode'>
>>> print x
Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>

So far so good.  But what I really want is just the text, so I try
something like:

>>> y = results[1].a.fetchText(re.compile('.+'))
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
  File "BeautifulSoup.py", line 466, in fetchText
    return self.fetch(recursive=recursive, text=text, limit=limit)
  File "BeautifulSoup.py", line 492, in fetch
    return self._fetch(name, attrs, text, limit, generator)
  File "BeautifulSoup.py", line 194, in _fetch
    if self._matches(i, text):
  File "BeautifulSoup.py", line 252, in _matches
    chunk = str(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 26: ordinal not in range(128)

Is this a bug?  Come to think of it, I'm not even sure how printing x
worked, since it printed non-ascii characters.

If I convert to a string first:
>>> filestr = file.encode("utf-8")
>>> soup = BeautifulSoup(filestr)
>>> soup('p','g')[1].font.fetchText(re.compile('.+'))
['Mobile Screen Scraping with ', 'BeautifulSoup', ' and Python for
Series 60. ', 'BeautifulSoup', ' 2', 'BeautifulSoup', ' 3. I
haven\xe2&euro;&trade;t had enough time to work up a proper hack for
', '...', 'www.postneo.com/2005/03/28/',
'mobile-screen-scraping-with-', 'beautifulsoup',
'-and-python-for-series-60 -  19k -  Aug 24, 2005 - ', ' ', 'Cached',
' - ', 'Similar&nbsp;pages']

The regex works, but things like "I haven\xe2&euro;&trade;t" get a bit
mangled :)  In filestr, it was represented as  haven\xe2\x80\x99t
which I guess is the ASCII representation for UTF-8.


More information about the Tutor mailing list