[Tutor] Problem handling utf-8 text

Kent Johnson kent37 at tds.net
Fri Mar 10 14:26:53 CET 2006


Ryan Ginstrom wrote:
> I am just learning python, or trying to, and am having trouble handling utf-8
> text.
> 
> I want to take a utf-8 encoded web page, and feed it to Beautiful Soup
> (http://crummy.com/software/BeautifulSoup/).
> BeautifulSoup uses SGMLParser to parse text.
> 
> But although I am able to read the utf-8 encoded Japanese text from the web
> page and print it to a file without corruption, the text coming out of
> Beautiful Soup is mangled. I imagine it's because the parser thinks I'm
> giving it a string in the system encoding, which is sjis.

You're not the first person to have trouble with BS and non-ascii text, 
unfortunately.

I wrote a program to test round-tripping data through BS. It turns out 
that BS is being 'helpful' and converting the chars in the range 0x80 to 
0x9F to equivalent entity escapes. This might be useful if the source 
text is in cp1252 but it is disastrous to utf-8 as you have discovered.

A solution is to turn off this fixup (and a few others) by passing 
avoidParserProblems=False to the BeautifulSoup constructor.

Here is a short program that successfully round-trips a selection of 
utf-8 chars:

from BeautifulSoup import BeautifulSoup

# Test data includes all codepoints from 32-255 as utf-8
data = ''.join(chr(n) for n in range(32,256))
data = unicode(data, 'latin-1').encode('utf-8')

html = '<body>' + data + '</body>'
soup = BeautifulSoup(html, avoidParserProblems=False)

newData = soup.body.string
print repr(data)
print
print repr(newData)

assert data == newData


Kent



More information about the Tutor mailing list