HTMLParser can't read japanese

Dodo dodo_do_not_wake_up at yahoo.Fr
Tue Apr 13 07:45:01 EDT 2010


alright, it's just because of Windows cmd
in IDLE it works fine

any workaround?

Dorian

Le 13/04/2010 13:40, Dodo a écrit :
> Here's a small script to generate again the error
> running windows 7 with python 3.1
>
> FILE : parseShift.py
>
> import urllib.request as url
> from html.parser import HTMLParser
>
> class myParser(HTMLParser):
> def handle_starttag(self, tag, attrs):
> print("Start of %s tag : %s" % (tag, attrs))
>
>
> test = myParser()
> handle = url.urlretrieve("http://localhost/shift.html")
> handleTemp = open( handle[0] , encoding="Shift-JIS" )
> test.feed( handleTemp.read() )
> handleTempl.close()
>
> FILE : shift.html (encoded Shift-JIS)
>
> <p class="thisisclass (not_in_japanese) reading_this_should_be_ok">Some
> random japanese
> <p><strong>東方プロジェクト</strong> <a href="#" title="キャプテン・ムラ
> サ">Link</a>
>
> OUTPUT
>
> Start of p tag : [('class', 'thisisclass (not_in_japanese)
> reading_this_should_be_ok')]
> Start of p tag : []
> Start of strong tag : []
> Traceback (most recent call last):
> File "D:\Dorian\Python\parseShift.py", line 12, in <module>
> test.feed( handleTemp.read() )
> File "C:\Python31\lib\html\parser.py", line 108, in feed
> self.goahead(0)
> File "C:\Python31\lib\html\parser.py", line 148, in goahead
> k = self.parse_starttag(i)
> File "C:\Python31\lib\html\parser.py", line 268, in parse_starttag
> self.handle_starttag(tag, attrs)
> File "D:\Dorian\Python\parseShift.py", line 6, in handle_starttag
> print("Start of %s tag : %s" % (tag, attrs))
> File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 44-52: c
> haracter maps to <undefined>
>
>
> any help?
> Dorian




More information about the Python-list mailing list