[Tutor] Problem handling utf-8 text
Kent Johnson
kent37 at tds.net
Fri Mar 10 14:26:53 CET 2006
Ryan Ginstrom wrote:
> I am just learning python, or trying to, and am having trouble handling utf-8
> text.
>
> I want to take a utf-8 encoded web page, and feed it to Beautiful Soup
> (http://crummy.com/software/BeautifulSoup/).
> BeautifulSoup uses SGMLParser to parse text.
>
> But although I am able to read the utf-8 encoded Japanese text from the web
> page and print it to a file without corruption, the text coming out of
> Beautiful Soup is mangled. I imagine it's because the parser thinks I'm
> giving it a string in the system encoding, which is sjis.
You're not the first person to have trouble with BS and non-ascii text,
unfortunately.
I wrote a program to test round-tripping data through BS. It turns out
that BS is being 'helpful' and converting the chars in the range 0x80 to
0x9F to equivalent entity escapes. This might be useful if the source
text is in cp1252 but it is disastrous to utf-8 as you have discovered.
A solution is to turn off this fixup (and a few others) by passing
avoidParserProblems=False to the BeautifulSoup constructor.
Here is a short program that successfully round-trips a selection of
utf-8 chars:
from BeautifulSoup import BeautifulSoup
# Test data includes all codepoints from 32-255 as utf-8
data = ''.join(chr(n) for n in range(32,256))
data = unicode(data, 'latin-1').encode('utf-8')
html = '<body>' + data + '</body>'
soup = BeautifulSoup(html, avoidParserProblems=False)
newData = soup.body.string
print repr(data)
print
print repr(newData)
assert data == newData
Kent
More information about the Tutor
mailing list