[I18n-sig] Re: [XML-SIG] Character encodings and expat

M.-A. Lemburg mal@lemburg.com
Mon, 30 Oct 2000 12:44:14 +0100


Andy Robinson wrote:
> 
> > The Asian codecs were just left out of the standard dist due
> > to size problems.
> 
> ...and also due to not all being written yet :-)

Well, we could have included Tamito's codecs, but the general
consent was not to due to the size of the mapping tables.

I think that we ought to start a project for implementing
the AsianCodecs package.

I'll look into wrapping the C lib iconv interface into a
codec package... provided I find some time :-(

I've had a look at the IANA character set registry 
(http://www.isi.edu/in-notes/iana/assignments/character-sets)
and compared the info to what we already have in Python 2.0. 

Here is a list of codecs which are not present in Python 2.0. It
would be nice if someone with access to the various sources could help
in putting together a few charmap codecs for these in case they
are really needed (I think some EBCDIC codecs would be helpful for
conversion of host data files)...

Missing Codecs:
------------------------------------------------------------------------
ISO-2022-KR                    : RFC-1557 (see also KS_C_5601-1987)
IBM00858                       : IBM See (.../assignments/character-set-info/IBM00858) [Mahdi]
DEC-MCS                        : VAX/VMS User's Manual,
EBCDIC-UK                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ISO-2022-CN                    : RFC-1922
MNEM                           : RFC 1345, also known as "mnemonic+ascii+8200"
T.101-G2                       : ECMA registry
KOI8-U                         : RFC 2319
IBM880                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
Windows-31J                    : Windows Japanese. A further extension of Shift_JIS
ISO_5427:1981                  : ECMA registry
JUS_I.B1.003-mac               : ECMA registry
ISO-8859-2-Windows-Latin-2     : Extended ISO 8859-2. Latin-2 for Windows 3.1.
Adobe-Symbol-Encoding          : PostScript Language Reference Manual
IBM275                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IT                             : ECMA registry
EBCDIC-AT-DE-A                 : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
GB_1988-80                     : ECMA registry
DS_2089                        : Danish Standard, DS 2089, February 1974
ISO-10646-UCS-Basic            : ASCII subset of Unicode. Basic Latin = collection 1
EBCDIC-CA-FR                   : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
TIS-620                        : Thai Industrial Standards Institute (TISI) [Tantsetthi]
IBM-Symbols                    : Presentation Set, CPGID: 259
MNEMONIC                       : RFC 1345, also known as "mnemonic+ascii+38"
CSA_Z243.4-1985-2              : ECMA registry
ISO-8859-9-Windows-Latin-5     : Extended ISO 8859-9. Latin-5 for Windows 3.1
ISO-2022-JP                    : RFC-1468 (see also RFC-2237)
GOST_19768-74                  : ECMA registry
DIN_66003                      : ECMA registry
EBCDIC-FR                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ASMO_449                       : ECMA registry
ISO-Unicode-IBM-1276           : IBM Cyrillic Greek Extended Presentation Set, GCSGID: 1276
latin-greek                    : ECMA registry
HZ-GB-2312                     : RFC 1842, RFC 1843 [RFC1842, RFC1843]
Big5-HKSCS                     : See (.../assignments/character-set-info/Big5-HKSCS)
ISO-10646-UCS-4                : the full code space. (same comment about byte order,
ISO-10646-UTF-1                : Universal Transfer Format (1), this is the multibyte
ISO-10646-UCS-2                : the 2-octet Basic Multilingual Plane, aka Unicode
CSA_Z243.4-1985-gr             : ECMA registry
latin-lap                      : ECMA registry
EBCDIC-ES                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
UNKNOWN-8BIT                   : 
EBCDIC-FI-SE                   : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
BS_4730                        : ECMA registry
IBM290                         : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
IBM420                         : IBM NLS RM Vol2 SE09-8002-01, March 1990,
JIS_Encoding                   : JIS X 0202-1991. Uses ISO 2022 escape sequences to
T.61-8bit                      : ECMA registry
ISO-2022-CN-EXT                : RFC-1922
Microsoft-Publishing           : PCL 5 Comparison Guide, Hewlett-Packard,
ISO-2022-JP-2                  : RFC-1554
ISO_5428:1980                  : ECMA registry
Ventura-Math                   : PCL 5 Comparison Guide, Hewlett-Packard,
EBCDIC-ES-S                    : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
windows-1251                   : Microsoft (see ../character-set-info/windows-1251) [Lazhintseva]
windows-1250                   : Microsoft (see ../character-set-info/windows-1250) [Lazhintseva]
windows-1253                   : Microsoft (see ../character-set-info/windows-1253) [Lazhintseva]
windows-1252                   : Microsoft (see ../character-set-info/windows-1252) [Wendt]
windows-1255                   : Microsoft (see ../character-set-info/windows-1255) [Lazhintseva]
windows-1254                   : Microsoft (see ../character-set-info/windows-1254) [Lazhintseva]
windows-1257                   : Microsoft (see ../character-set-info/windows-1257) [Lazhintseva]
windows-1256                   : Microsoft (see ../character-set-info/windows-1256) [Lazhintseva]
windows-1258                   : Microsoft (see ../character-set-info/windows-1258) [Lazhintseva]
JUS_I.B1.002                   : ECMA registry
ISO_8859-8-I                   : RFC-1556
CSA_Z243.4-1985-1              : ECMA registry
JIS_X0212-1990                 : ECMA registry
ISO_5427                       : ECMA registry
ISO_6937-2-add                 : ECMA registry and ISO 6937-2:1983
ISO_8859-8-E                   : RFC-1556
BS_viewdata                    : ECMA registry
IBM281                         : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
IBM280                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM285                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM284                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
Adobe-Standard-Encoding        : PostScript Language Reference Manual
ISO_646.irv:1983               : ECMA registry
GB2312                         : Chinese for People's Republic of China (PRC) mixed one byte,
Extended_UNIX_Code_Fixed_Width_for_Japanese : Used in Japan. Each character is 2 octets.
SEN_850200_B                   : ECMA registry
SEN_850200_C                   : ECMA registry
Ventura-International          : Ventura International. ASCII plus coded characters similar
ISO-Unicode-IBM-1265           : IBM Hebrew Presentation Set, GCSGID: 1265
ISO-Unicode-IBM-1264           : IBM Arabic Presentation Set, GCSGID: 1264
ISO-Unicode-IBM-1261           : IBM Latin-2, -3, -5, Extended Presentation Set, GCSGID: 1261
IBM851                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
PC8-Turkish                    : PC Latin Turkish. PCL Symbol Set id: 9T
ISO_8859-supp                  : ECMA registry
ISO-Unicode-IBM-1268           : IBM Latin-4 Extended Presentation Set, GCSGID: 1268
EBCDIC-ES-A                    : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ISO-8859-1-Windows-3.0-Latin-1 : Extended ISO 8859-1 Latin-1 for Windows 3.0.
IBM01149                       : IBM See (.../assignments/character-set-info/IBM01149) [Mahdi]
ECMA-cyrillic                  : ECMA registry
IBM01147                       : IBM See (.../assignments/character-set-info/IBM01147) [Mahdi]
NATS-DANO-ADD                  : ECMA registry
IBM01145                       : IBM See (.../assignments/character-set-info/IBM01145) [Mahdi]
IBM01144                       : IBM See (.../assignments/character-set-info/IBM01144) [Mahdi]
IBM01143                       : IBM See (.../assignments/character-set-info/IBM01143) [Mahdi]
IBM01141                       : IBM See (.../assignments/character-set-info/IBM01141) [Mahdi]
IBM01140                       : IBM See (.../assignments/character-set-info/IBM01140) [Mahdi]
macintosh                      : The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
IBM278                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
NS_4551-2                      : ECMA registry
IBM274                         : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
NS_4551-1                      : ECMA registry
JIS_C6226-1983                 : ECMA registry
ANSI_X3.110-1983               : ECMA registry
IBM273                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
JIS_C6229-1984-b               : ECMA registry
greek7                         : ECMA registry
EUC-KR                         : RFC-1557 (see also KS_C_5861-1992)
NF_Z_62-010                    : ECMA registry
JIS_X0201                      : JIS X 0201-1976. One byte only, this is equivalent to
IBM01146                       : IBM See (.../assignments/character-set-info/IBM01146) [Mahdi]
IBM01148                       : IBM See (.../assignments/character-set-info/IBM01148) [Mahdi]
ES                             : ECMA registry
PT2                            : ECMA registry
INIS-cyrillic                  : ECMA registry
NF_Z_62-010_(1973)             : ECMA registry
greek-ccitt                    : ECMA registry
EBCDIC-AT-DE                   : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
JIS_C6229-1984-b-add           : ECMA registry
Big5                           : Chinese for Taiwan Multi-byte set.
MSZ_7795.3                     : ECMA registry
JIS_C6220-1969-ro              : ECMA registry
videotex-suppl                 : ECMA registry
HP-Math8                       : PCL 5 Comparison Guide, Hewlett-Packard,
IBM01142                       : IBM See (.../assignments/character-set-info/IBM01142) [Mahdi]
HP-DeskTop                     : PCL 5 Comparison Guide, Hewlett-Packard,
ISO_8859-6-I                   : RFC-1556
IBM00924                       : IBM See (.../assignments/character-set-info/IBM00924) [Mahdi]
JIS_C6229-1984-kana            : ECMA registry
IBM277                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
JUS_I.B1.003-serb              : ECMA registry
IBM870                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM871                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
EBCDIC-FI-SE-A                 : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
IBM903                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM904                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
VIQR                           : RFC 1456
EBCDIC-PT                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
JIS_C6220-1969-jp              : ECMA registry
ISO_10367-box                  : ECMA registry
JIS_C6229-1984-hand-add        : ECMA registry
PC8-Danish-Norwegian           : PC Danish Norwegian
KS_C_5601-1987                 : ECMA registry
iso-ir-90                      : ECMA registry
greek7-old                     : ECMA registry
us-dk                          : 
ISO-8859-1-Windows-3.1-Latin-1 : Extended ISO 8859-1 Latin-1 for Windows 3.1.
IBM918                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
hp-roman8                      : LaserJet IIP Printer User's Manual,
IBM905                         : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
ISO_2033-1983                  : ECMA registry
IBM-Thai                       : Presentation Set, CPGID: 838
NATS-DANO                      : ECMA registry
IBM868                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM297                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
Latin-greek-1                  : ECMA registry
EBCDIC-US                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
IBM423                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
ISO_6937-2-25                  : ECMA registry
ES2                            : ECMA registry
NATS-SEFI                      : ECMA registry
KSC5636                        : 
ISO-10646-Unicode-Latin1       : ISO Latin-1 subset of Unicode. Basic Latin and Latin-1
GB_2312-80                     : ECMA registry
HP-Legal                       : PCL 5 Comparison Guide, Hewlett-Packard,
ISO_8859-6-E                   : RFC-1556
Extended_UNIX_Code_Packed_Format_for_Japanese : Standardized by OSF, UNIX International, and UNIX Systems
ISO_646.basic:1983             : ECMA registry
INIS-8                         : ECMA registry
JIS_C6229-1984-hand            : ECMA registry
EBCDIC-DK-NO                   : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
VISCII                         : RFC 1456
INIS                           : ECMA registry
PT                             : ECMA registry
Ventura-US                     : Ventura US. ASCII plus characters typically used in
CSN_369103                     : ECMA registry
JIS_C6226-1978                 : ECMA registry
IBM891                         : IBM NLS RM Vol2 SE09-8002-01, March 1990
dk-us                          : 
EBCDIC-IT                      : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
UNICODE-1-1                    : RFC 1641
UNICODE-1-1-UTF-7              : RFC 1642
JIS_C6229-1984-a               : ECMA registry
INVARIANT                      : 
HP-Pi-font                     : PCL 5 Comparison Guide, Hewlett-Packard,
NATS-SEFI-ADD                  : ECMA registry
IBM038                         : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
T.61-7bit                      : ECMA registry
IEC_P27-1                      : ECMA registry
ISO-10646-J-1                  : ISO 10646 Japanese, see RFC 1815.
Shift_JIS                      : This charset is an extension of csHalfWidthKatakana by
EBCDIC-DK-NO-A                 : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
NC_NC00-10:81                  : ECMA registry

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/