[Tutor] Problem handling utf-8 text
Ryan Ginstrom
ryang at gol.com
Fri Mar 10 11:32:19 CET 2006
I am just learning python, or trying to, and am having trouble handling utf-8
text.
I want to take a utf-8 encoded web page, and feed it to Beautiful Soup
(http://crummy.com/software/BeautifulSoup/).
BeautifulSoup uses SGMLParser to parse text.
But although I am able to read the utf-8 encoded Japanese text from the web
page and print it to a file without corruption, the text coming out of
Beautiful Soup is mangled. I imagine it's because the parser thinks I'm
giving it a string in the system encoding, which is sjis.
Here is the code I am using:
# -*- coding: utf-8 -*-
# ==============================
# Test program to read in utf-8 encoded html page
# ==============================
import urllib2, pprint
from BeautifulSoup import BeautifulSoup
# utf-8 encoded content
html = urllib2.urlopen( 'http://jat.org/jtt/index.html' ).read()
# write the raw html to raw.txt
# This comes out ok
file1 = open("raw.txt", "w")
print >> file1, html
file1.close()
# write the parsed html to parsed.txt
# The Japanese text is garbled in this one
file2 = open("parsed.txt", "w")
soup = BeautifulSoup()
soup.feed( html )
print >> file2, soup.html
file2.close()
# ==============================
Any help much appreciated.
Regards,
Ryan
---
Ryan Ginstrom
ryang at gol.com / translation at ginstrom.com
http://ginstrom.com
More information about the Tutor
mailing list