cp936 uses gbk codec, doesn't decode `\x80` as U+20AC EURO SIGN

John Machin sjmachin at lexicon.net
Sun Oct 10 17:15:50 EDT 2010


|>>> '\x80'.decode('cp936')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80
 in position 0: incomplete multibyte sequence

However:

Retrieved 2010-10-10 from
http://www.unicode.org/Public
/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

    #    Name:     cp936 to Unicode table
    #    Unicode version: 2.0
    #    Table version: 2.01
    #    Table format:  Format A
    #    Date:          1/7/2000
    #
    #    Contact:       Shawn.Steele at microsoft.com
    ...
    0x7F    0x007F  #DELETE
    0x80    0x20AC  #EURO SIGN
    0x81            #DBCS LEAD BYTE

Retrieved 2010-10-10 from
http://msdn.microsoft.com/en-us/goglobal/cc305153.aspx

    Windows Codepage 936
    [pictorial mapping; shows 80 mapping to 20AC]

Retrieved 2010-10-10 from
http://demo.icu-project.org
/icu-bin/convexp?conv=windows-936-2000&s=ALL

    [pictorial mapping for converter
    "windows-936-2000" with
    aliases including GBK, CP936, MS936;
    shows 80 mapping to 20AC]

So Microsoft appears to think that
cp936 includes the euro,
and the ICU project seem to think that GBK and cp936
both include the euro.

A couple of questions:

Is this a bug or a shrug?

Where can one find the mapping tables
from which the various CJK codecs are derived?







More information about the Python-list mailing list