[Tutor] Beautiful Soup / Unicode problem?
grouch at gmail.com
Thu Aug 25 20:24:38 CEST 2005
I'm having bang-my-head-against-a-wall moments trying to figure all of this out.
A word of warming, this is the first time I've tried using unicode, or
Beautiful Soup, so if I'm being stupid, please forgive me. I'm trying
to scrape results from google as a test case. with Beautiful Soup.
I've seen people recommend it here, so maybe somebody can recognize
what I'm doing wrong:
>>>from BeautifulSoup import BeautifulSoup
>>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
>>>file = file.read().decode("utf-8")
>>>soup = BeautifulSoup(file)
>>>results = soup('p','g')
>>> x = results.a.renderContents()
>>> print x
Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
So far so good. But what I really want is just the text, so I try
>>> y = results.a.fetchText(re.compile('.+'))
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "BeautifulSoup.py", line 466, in fetchText
return self.fetch(recursive=recursive, text=text, limit=limit)
File "BeautifulSoup.py", line 492, in fetch
return self._fetch(name, attrs, text, limit, generator)
File "BeautifulSoup.py", line 194, in _fetch
if self._matches(i, text):
File "BeautifulSoup.py", line 252, in _matches
chunk = str(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 26: ordinal not in range(128)
Is this a bug? Come to think of it, I'm not even sure how printing x
worked, since it printed non-ascii characters.
If I convert to a string first:
>>> filestr = file.encode("utf-8")
>>> soup = BeautifulSoup(filestr)
['Mobile Screen Scraping with ', 'BeautifulSoup', ' and Python for
Series 60. ', 'BeautifulSoup', ' 2', 'BeautifulSoup', ' 3. I
haven\xe2€™t had enough time to work up a proper hack for
', '...', 'www.postneo.com/2005/03/28/',
'-and-python-for-series-60 - 19k - Aug 24, 2005 - ', ' ', 'Cached',
' - ', 'Similar pages']
The regex works, but things like "I haven\xe2€™t" get a bit
mangled :) In filestr, it was represented as haven\xe2\x80\x99t
which I guess is the ASCII representation for UTF-8.
More information about the Tutor