[Tutor] Beautiful Soup / Unicode problem?

Fri Aug 26 18:53:34 CEST 2005

> Here you go:
>
> >>> import types
> >>> print types.StringTypes
> (<type 'str'>, <type 'unicode'>)
> >>> import sys
> >>> print sys.version
> 2.3.4 (#2, May 29 2004, 03:31:27)
> [GCC 3.3.3 (Debian 20040417)]
> >>> print type(u'hello' in types.StringTypes
> True
> >>>sys.getdefaultencoding()
> 'ascii'

[CCing Leonard Richardson: we found a bug and a correction to the code.
See below.]

Ok, this is officially a mystery.  *grin*  Let me try some tests too.

######
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> import re
>>> result = soup.fetchText(re.compile('.*'))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "BeautifulSoup.py", line 465, in fetchText
    return self.fetch(recursive=recursive, text=text, limit=limit)
  File "BeautifulSoup.py", line 491, in fetch
    return self._fetch(name, attrs, text, limit, generator)
  File "BeautifulSoup.py", line 193, in _fetch
    if self._matches(i, text):
  File "BeautifulSoup.py", line 251, in _matches
    chunk = str(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 0: ordinal not in range(128)
######

Gaaa!  Ok, that's not right.  Well, at least I'm seeing the same results
as you.  *grin* This seems like a bug in BeautifulSoup; let me look at the
flow of values again... ah! I see.  That was silly.

The problem is that 'chunk' can be a NavigableString or a
NavigatableUnicodeString, and neither of those types are in
types.StringType.  So the bit of code here:

        if not type(chunk) in types.StringTypes:

never worked properly.  *grin*

A possible fix to this is to change the check for direct types into a
check for subclass or isinstance; we can to change the line in
BeautifulSoup.py:250 from:

        if not type(chunk) in types.StringTypes:

to:

        if not isinstance(chunk, basestring):

Testing the change now...

######
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> result = soup.fetchText(re.compile('.*'))
>>> result
[u'\xbb']
######

Ah, better.  *grin*

One other problem is the implementation of __repr__(); I know it's
convenient for it to delegate to str(), but that poses a problem:

######
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "BeautifulSoup.py", line 374, in __repr__
    return str(self)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 6: ordinal not in range(128)
######

repr() should never fail like this, regardless of our default encoding.
The cheap way out might be to just not implement repr(), but that's
probably not so nice.  *grin* I'd have to look at the implementation of
__str__() some more and see if there's a good general way to fix this.

Best of wishes!