HTMLParser can't read japanese

Dodo dodo_do_not_wake_up at yahoo.Fr
Tue Apr 13 07:40:50 EDT 2010

Here's a small script to generate again the error
running windows 7 with python 3.1


import urllib.request as url
from html.parser import HTMLParser

class myParser(HTMLParser):
	def handle_starttag(self, tag, attrs):
		print("Start of %s tag : %s" % (tag, attrs))

test = myParser()		
handle = url.urlretrieve("http://localhost/shift.html")
handleTemp = open( handle[0] , encoding="Shift-JIS" )
test.feed( )

FILE : shift.html (encoded Shift-JIS)

<p class="thisisclass (not_in_japanese) reading_this_should_be_ok">Some 
random japanese
<p><strong>東方プロジェクト</strong> <a href="#" title="キャプテン・ムラ 


Start of p tag : [('class', 'thisisclass (not_in_japanese) 
Start of p tag : []
Start of strong tag : []
Traceback (most recent call last):
   File "D:\Dorian\Python\", line 12, in <module>
     test.feed( )
   File "C:\Python31\lib\html\", line 108, in feed
   File "C:\Python31\lib\html\", line 148, in goahead
     k = self.parse_starttag(i)
   File "C:\Python31\lib\html\", line 268, in parse_starttag
     self.handle_starttag(tag, attrs)
   File "D:\Dorian\Python\", line 6, in handle_starttag
     print("Start of %s tag : %s" % (tag, attrs))
   File "C:\Python31\lib\encodings\", line 19, in encode
     return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 
44-52: c
haracter maps to <undefined>

any help?

More information about the Python-list mailing list