[Tutor] Beautiful Soup / Unicode problem?
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Fri Aug 26 18:53:34 CEST 2005
> Here you go:
>
> >>> import types
> >>> print types.StringTypes
> (<type 'str'>, <type 'unicode'>)
> >>> import sys
> >>> print sys.version
> 2.3.4 (#2, May 29 2004, 03:31:27)
> [GCC 3.3.3 (Debian 20040417)]
> >>> print type(u'hello' in types.StringTypes
> True
> >>>sys.getdefaultencoding()
> 'ascii'
[CCing Leonard Richardson: we found a bug and a correction to the code.
See below.]
Ok, this is officially a mystery. *grin* Let me try some tests too.
######
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> import re
>>> result = soup.fetchText(re.compile('.*'))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "BeautifulSoup.py", line 465, in fetchText
return self.fetch(recursive=recursive, text=text, limit=limit)
File "BeautifulSoup.py", line 491, in fetch
return self._fetch(name, attrs, text, limit, generator)
File "BeautifulSoup.py", line 193, in _fetch
if self._matches(i, text):
File "BeautifulSoup.py", line 251, in _matches
chunk = str(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 0: ordinal not in range(128)
######
Gaaa! Ok, that's not right. Well, at least I'm seeing the same results
as you. *grin* This seems like a bug in BeautifulSoup; let me look at the
flow of values again... ah! I see. That was silly.
The problem is that 'chunk' can be a NavigableString or a
NavigatableUnicodeString, and neither of those types are in
types.StringType. So the bit of code here:
if not type(chunk) in types.StringTypes:
never worked properly. *grin*
A possible fix to this is to change the check for direct types into a
check for subclass or isinstance; we can to change the line in
BeautifulSoup.py:250 from:
if not type(chunk) in types.StringTypes:
to:
if not isinstance(chunk, basestring):
Testing the change now...
######
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> result = soup.fetchText(re.compile('.*'))
>>> result
[u'\xbb']
######
Ah, better. *grin*
One other problem is the implementation of __repr__(); I know it's
convenient for it to delegate to str(), but that poses a problem:
######
>>> soup = BeautifulSoup.BeautifulSoup(u"<html>\xbb</html>")
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "BeautifulSoup.py", line 374, in __repr__
return str(self)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in
position 6: ordinal not in range(128)
######
repr() should never fail like this, regardless of our default encoding.
The cheap way out might be to just not implement repr(), but that's
probably not so nice. *grin* I'd have to look at the implementation of
__str__() some more and see if there's a good general way to fix this.
Best of wishes!
More information about the Tutor
mailing list