[Tutor] Re: chinese in python23
M.-A. Lemburg
mal@lemburg.com
Mon Jun 30 12:41:31 2003
jyllyj wrote:
> environment:
> window xp
> python23
>
> i'm in default chinese gb2312 charset
> in ./python23/lib/encoding/ no found gb2312 encode/decode
> so i get gb2312 charset map from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
> exec /Python23/Tools/Scripts/gencodec.py get gb2312.py
> put gb2312.py into /python23/lib/encoding/
> in IDLE 0.8
>
>>>>import codecs
>>>>codecs.lookup('gb2312')
>
> (<bound method Codec.encode of <encodings.gb2312.Codec instance at 0x01A073F0>>, <bound method Codec.decode of <encodings.gb2312.Codec instance at 0x01A07FD0>>, <class encodings.gb2312.StreamReader at 0x010F04E0>, <class encodings.gb2312.StreamWriter at 0x010F04B0>)
>
> look fine!
>
>>>>text='???' #chinese char
>>>>text.decode('gb2312')
>
> Traceback (most recent call last):
> File "<pyshell#28>", line 1, in ?
> text.decode('gb2312')
> File "C:\Python23\lib\encodings\gb2312.py", line 22, in decode
> return codecs.charmap_decode(input,errors,decoding_map)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0xbd in position 0: character maps to <undefined>
>
> what's missing?
The charmap codec will only map 8-bit encodings to Unicode (and
vice-versa). GB2312 is given as 16-bit encoding in the table
you quote.
You should probably try one of the available CJK codec
package available for Python, e.g.
"""
http://sourceforge.net/project/showfiles.php?group_id=46747
The CJKCodecs is a unified unicode codec set for Chinese, Japanese
and Korean encodings. It supports full features of unicode codec
specification and PEP293 error callbacks on Python 2.3.
Currently supported encodings and planned updates:
Authority 0.9 1.0 1.1 1.2
==============================================================================
China (PRC) gb2312 iso-2022-cn
gbk(cp936) iso-2022-cn-ext
gb18030
hz
Hong Kong hkscs
Japan shift-jis iso-2022-jp-2 euc-jisx0213 iso-2022-int-1
euc-jp shift-jisx0213 mac_japanese
cp932 iso-2022-jp-3
iso-2022-jp
iso-2022-jp-1
Korea (ROK) euc-kr (ksx1001:2002) mac_korean
cp949(uhc) unijohab
johab
iso-2022-kr
Korea (DPRK) euc-kp
Taiwan big5 iso-2022-cn
cp950 iso-2022-cn-ext
euc-tw
Unicode.org utf-8 utf-7
utf-16
"""
--
Marc-Andre Lemburg
eGenix.com
Professional Python Software directly from the Source (#1, Jun 28 2003)
>>> Python/Zope Products & Consulting ... http://www.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________