[Tutor] Re: chinese in python23

M.-A. Lemburg mal@lemburg.com
Mon Jun 30 12:41:31 2003


jyllyj wrote:
> environment:
> window xp
> python23
> 
> i'm in default chinese gb2312 charset
> in ./python23/lib/encoding/ no found gb2312 encode/decode
> so i get gb2312 charset map from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
> exec /Python23/Tools/Scripts/gencodec.py get gb2312.py
> put gb2312.py into /python23/lib/encoding/
> in IDLE 0.8
> 
>>>>import codecs
>>>>codecs.lookup('gb2312')
> 
> (<bound method Codec.encode of <encodings.gb2312.Codec instance at 0x01A073F0>>, <bound method Codec.decode of <encodings.gb2312.Codec instance at 0x01A07FD0>>, <class encodings.gb2312.StreamReader at 0x010F04E0>, <class encodings.gb2312.StreamWriter at 0x010F04B0>)
> 
> look fine!
> 
>>>>text='???' #chinese char
>>>>text.decode('gb2312')
> 
> Traceback (most recent call last):
>   File "<pyshell#28>", line 1, in ?
>     text.decode('gb2312')
>   File "C:\Python23\lib\encodings\gb2312.py", line 22, in decode
>     return codecs.charmap_decode(input,errors,decoding_map)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0xbd in position 0: character maps to <undefined>
> 
> what's missing?

The charmap codec will only map 8-bit encodings to Unicode (and
vice-versa). GB2312 is given as 16-bit encoding in the table
you quote.

You should probably try one of the available CJK codec
package available for Python, e.g.

"""
  http://sourceforge.net/project/showfiles.php?group_id=46747


The CJKCodecs is a unified unicode codec set for Chinese, Japanese
and Korean encodings. It supports full features of unicode codec
specification and PEP293 error callbacks on Python 2.3.

Currently supported encodings and planned updates:

Authority       0.9             1.0             1.1             1.2
==============================================================================
China (PRC)     gb2312                          iso-2022-cn
                 gbk(cp936)                      iso-2022-cn-ext
                 gb18030
                 hz

Hong Kong                                                       hkscs

Japan           shift-jis       iso-2022-jp-2   euc-jisx0213    iso-2022-int-1
                 euc-jp                          shift-jisx0213  mac_japanese
                 cp932                           iso-2022-jp-3
                 iso-2022-jp
                 iso-2022-jp-1

Korea (ROK)     euc-kr                          (ksx1001:2002)  mac_korean
                 cp949(uhc)                                      unijohab
                 johab
                 iso-2022-kr

Korea (DPRK)                                                    euc-kp

Taiwan          big5                            iso-2022-cn
                 cp950                           iso-2022-cn-ext
                                                 euc-tw

Unicode.org     utf-8           utf-7
                                 utf-16

"""

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Software directly from the Source  (#1, Jun 28 2003)
 >>> Python/Zope Products & Consulting ...         http://www.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________