[Tutor] Problem handling utf-8 text

Ryan Ginstrom ryang at gol.com
Fri Mar 10 11:32:19 CET 2006


I am just learning python, or trying to, and am having trouble handling utf-8
text.

I want to take a utf-8 encoded web page, and feed it to Beautiful Soup
(http://crummy.com/software/BeautifulSoup/).
BeautifulSoup uses SGMLParser to parse text.

But although I am able to read the utf-8 encoded Japanese text from the web
page and print it to a file without corruption, the text coming out of
Beautiful Soup is mangled. I imagine it's because the parser thinks I'm
giving it a string in the system encoding, which is sjis.

Here is the code I am using:

# -*- coding: utf-8 -*-

# ==============================
# Test program to read in utf-8 encoded html page
# ==============================

import urllib2, pprint
from BeautifulSoup import BeautifulSoup

# utf-8 encoded content
html = urllib2.urlopen( 'http://jat.org/jtt/index.html' ).read()

# write the raw html to raw.txt
# This comes out ok
file1 = open("raw.txt", "w")
print >> file1, html
file1.close()

# write the parsed html to parsed.txt
# The Japanese text is garbled in this one
file2 = open("parsed.txt", "w")
soup = BeautifulSoup()
soup.feed( html )
print >> file2, soup.html
file2.close()

# ==============================

Any help much appreciated.

Regards,
Ryan

---
Ryan Ginstrom
ryang at gol.com / translation at ginstrom.com 
http://ginstrom.com 



More information about the Tutor mailing list