[Tutor] Beautiful Soup / Unicode problem?

Fri Aug 26 00:41:06 CEST 2005

On Thu, 25 Aug 2005, grouchy wrote:

> >>>file = urllib.urlopen("http://www.google.com/search?q=beautifulsoup")
> >>>file = file.read().decode("utf-8")
> >>>soup = BeautifulSoup(file)
> >>>results = soup('p','g')
> >>> x = results[1].a.renderContents()
> >>> type(x)
> <type 'unicode'>
> >>> print x
> Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>

Hi Grouchy,

So far, so good.  You were lucky to be able to print 'x' off-hand like
that.  When we str() a unicode string, Python will use the default
encoding scheme:

######
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
######

On my Linux machine, as long as that unicode string didn't include
anything that couldn't be encoded as ascii, I'm ok.

Of course, the flip side of this is that some unicode strings can't be
str()'ed right off the bat:

######
>>> nonasciiMsg = unicode(u'\xbb')
>>> nonasciiMsg
u'\xbb'
>>> str(nonasciiMsg)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 0: ordinal not in range(128)
######

Note here that we can still get a kind of string representation of the
nonasciiMsg here; it's when we use str() that bad things happen, and
that's because the print statement uses str() as a helper utility.

> So far so good.  But what I really want is just the text, so I try
> something like:
>
> >>> y = results[1].a.fetchText(re.compile('.+'))
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
>   File "BeautifulSoup.py", line 466, in fetchText
>     return self.fetch(recursive=recursive, text=text, limit=limit)
>   File "BeautifulSoup.py", line 492, in fetch
>     return self._fetch(name, attrs, text, limit, generator)
>   File "BeautifulSoup.py", line 194, in _fetch
>     if self._matches(i, text):
>   File "BeautifulSoup.py", line 252, in _matches
>     chunk = str(chunk)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
> position 26: ordinal not in range(128)

That's odd!  Ok, let's check why BeautifulSoup is str()-ing the chunk...

### in BeautifulSoup.py #######################
        #Now we know that chunk is a string
        if not type(chunk) in types.StringTypes:
            chunk = str(chunk)
        if hasattr(howToMatch, 'match'):
            # It's a regexp object.
            return howToMatch.search(chunk)
################################################

Ok, that's surprising!  Isn't a unicode string's type in
types.StringTypes?

######
>>> import types
>>> types.StringTypes
(<type 'str'>, <type 'unicode'>)
######

Ok.  That, too, looks fine.  The error message implies that it goes into
line 252, where 'chunk' is a unicode string.  But from the experiments on
my system, running on Python 2.3.5, I don't see how this is doing that.
Mysterious.

If you have a moment, do you mind doing this on your system?

######
import types
print types.StringTypes
import sys
print sys.version()
print type(u'hello') in types.StringTypes
######

and show us what comes up?

Good luck to you!