[I18n-sig] Re: [XML-SIG] Character encodings and expat
M.-A. Lemburg
mal@lemburg.com
Mon, 30 Oct 2000 12:44:14 +0100
Andy Robinson wrote:
>
> > The Asian codecs were just left out of the standard dist due
> > to size problems.
>
> ...and also due to not all being written yet :-)
Well, we could have included Tamito's codecs, but the general
consent was not to due to the size of the mapping tables.
I think that we ought to start a project for implementing
the AsianCodecs package.
I'll look into wrapping the C lib iconv interface into a
codec package... provided I find some time :-(
I've had a look at the IANA character set registry
(http://www.isi.edu/in-notes/iana/assignments/character-sets)
and compared the info to what we already have in Python 2.0.
Here is a list of codecs which are not present in Python 2.0. It
would be nice if someone with access to the various sources could help
in putting together a few charmap codecs for these in case they
are really needed (I think some EBCDIC codecs would be helpful for
conversion of host data files)...
Missing Codecs:
------------------------------------------------------------------------
ISO-2022-KR : RFC-1557 (see also KS_C_5601-1987)
IBM00858 : IBM See (.../assignments/character-set-info/IBM00858) [Mahdi]
DEC-MCS : VAX/VMS User's Manual,
EBCDIC-UK : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ISO-2022-CN : RFC-1922
MNEM : RFC 1345, also known as "mnemonic+ascii+8200"
T.101-G2 : ECMA registry
KOI8-U : RFC 2319
IBM880 : IBM NLS RM Vol2 SE09-8002-01, March 1990
Windows-31J : Windows Japanese. A further extension of Shift_JIS
ISO_5427:1981 : ECMA registry
JUS_I.B1.003-mac : ECMA registry
ISO-8859-2-Windows-Latin-2 : Extended ISO 8859-2. Latin-2 for Windows 3.1.
Adobe-Symbol-Encoding : PostScript Language Reference Manual
IBM275 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IT : ECMA registry
EBCDIC-AT-DE-A : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
GB_1988-80 : ECMA registry
DS_2089 : Danish Standard, DS 2089, February 1974
ISO-10646-UCS-Basic : ASCII subset of Unicode. Basic Latin = collection 1
EBCDIC-CA-FR : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
TIS-620 : Thai Industrial Standards Institute (TISI) [Tantsetthi]
IBM-Symbols : Presentation Set, CPGID: 259
MNEMONIC : RFC 1345, also known as "mnemonic+ascii+38"
CSA_Z243.4-1985-2 : ECMA registry
ISO-8859-9-Windows-Latin-5 : Extended ISO 8859-9. Latin-5 for Windows 3.1
ISO-2022-JP : RFC-1468 (see also RFC-2237)
GOST_19768-74 : ECMA registry
DIN_66003 : ECMA registry
EBCDIC-FR : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ASMO_449 : ECMA registry
ISO-Unicode-IBM-1276 : IBM Cyrillic Greek Extended Presentation Set, GCSGID: 1276
latin-greek : ECMA registry
HZ-GB-2312 : RFC 1842, RFC 1843 [RFC1842, RFC1843]
Big5-HKSCS : See (.../assignments/character-set-info/Big5-HKSCS)
ISO-10646-UCS-4 : the full code space. (same comment about byte order,
ISO-10646-UTF-1 : Universal Transfer Format (1), this is the multibyte
ISO-10646-UCS-2 : the 2-octet Basic Multilingual Plane, aka Unicode
CSA_Z243.4-1985-gr : ECMA registry
latin-lap : ECMA registry
EBCDIC-ES : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
UNKNOWN-8BIT :
EBCDIC-FI-SE : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
BS_4730 : ECMA registry
IBM290 : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
IBM420 : IBM NLS RM Vol2 SE09-8002-01, March 1990,
JIS_Encoding : JIS X 0202-1991. Uses ISO 2022 escape sequences to
T.61-8bit : ECMA registry
ISO-2022-CN-EXT : RFC-1922
Microsoft-Publishing : PCL 5 Comparison Guide, Hewlett-Packard,
ISO-2022-JP-2 : RFC-1554
ISO_5428:1980 : ECMA registry
Ventura-Math : PCL 5 Comparison Guide, Hewlett-Packard,
EBCDIC-ES-S : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
windows-1251 : Microsoft (see ../character-set-info/windows-1251) [Lazhintseva]
windows-1250 : Microsoft (see ../character-set-info/windows-1250) [Lazhintseva]
windows-1253 : Microsoft (see ../character-set-info/windows-1253) [Lazhintseva]
windows-1252 : Microsoft (see ../character-set-info/windows-1252) [Wendt]
windows-1255 : Microsoft (see ../character-set-info/windows-1255) [Lazhintseva]
windows-1254 : Microsoft (see ../character-set-info/windows-1254) [Lazhintseva]
windows-1257 : Microsoft (see ../character-set-info/windows-1257) [Lazhintseva]
windows-1256 : Microsoft (see ../character-set-info/windows-1256) [Lazhintseva]
windows-1258 : Microsoft (see ../character-set-info/windows-1258) [Lazhintseva]
JUS_I.B1.002 : ECMA registry
ISO_8859-8-I : RFC-1556
CSA_Z243.4-1985-1 : ECMA registry
JIS_X0212-1990 : ECMA registry
ISO_5427 : ECMA registry
ISO_6937-2-add : ECMA registry and ISO 6937-2:1983
ISO_8859-8-E : RFC-1556
BS_viewdata : ECMA registry
IBM281 : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
IBM280 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM285 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM284 : IBM NLS RM Vol2 SE09-8002-01, March 1990
Adobe-Standard-Encoding : PostScript Language Reference Manual
ISO_646.irv:1983 : ECMA registry
GB2312 : Chinese for People's Republic of China (PRC) mixed one byte,
Extended_UNIX_Code_Fixed_Width_for_Japanese : Used in Japan. Each character is 2 octets.
SEN_850200_B : ECMA registry
SEN_850200_C : ECMA registry
Ventura-International : Ventura International. ASCII plus coded characters similar
ISO-Unicode-IBM-1265 : IBM Hebrew Presentation Set, GCSGID: 1265
ISO-Unicode-IBM-1264 : IBM Arabic Presentation Set, GCSGID: 1264
ISO-Unicode-IBM-1261 : IBM Latin-2, -3, -5, Extended Presentation Set, GCSGID: 1261
IBM851 : IBM NLS RM Vol2 SE09-8002-01, March 1990
PC8-Turkish : PC Latin Turkish. PCL Symbol Set id: 9T
ISO_8859-supp : ECMA registry
ISO-Unicode-IBM-1268 : IBM Latin-4 Extended Presentation Set, GCSGID: 1268
EBCDIC-ES-A : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
ISO-8859-1-Windows-3.0-Latin-1 : Extended ISO 8859-1 Latin-1 for Windows 3.0.
IBM01149 : IBM See (.../assignments/character-set-info/IBM01149) [Mahdi]
ECMA-cyrillic : ECMA registry
IBM01147 : IBM See (.../assignments/character-set-info/IBM01147) [Mahdi]
NATS-DANO-ADD : ECMA registry
IBM01145 : IBM See (.../assignments/character-set-info/IBM01145) [Mahdi]
IBM01144 : IBM See (.../assignments/character-set-info/IBM01144) [Mahdi]
IBM01143 : IBM See (.../assignments/character-set-info/IBM01143) [Mahdi]
IBM01141 : IBM See (.../assignments/character-set-info/IBM01141) [Mahdi]
IBM01140 : IBM See (.../assignments/character-set-info/IBM01140) [Mahdi]
macintosh : The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
IBM278 : IBM NLS RM Vol2 SE09-8002-01, March 1990
NS_4551-2 : ECMA registry
IBM274 : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
NS_4551-1 : ECMA registry
JIS_C6226-1983 : ECMA registry
ANSI_X3.110-1983 : ECMA registry
IBM273 : IBM NLS RM Vol2 SE09-8002-01, March 1990
JIS_C6229-1984-b : ECMA registry
greek7 : ECMA registry
EUC-KR : RFC-1557 (see also KS_C_5861-1992)
NF_Z_62-010 : ECMA registry
JIS_X0201 : JIS X 0201-1976. One byte only, this is equivalent to
IBM01146 : IBM See (.../assignments/character-set-info/IBM01146) [Mahdi]
IBM01148 : IBM See (.../assignments/character-set-info/IBM01148) [Mahdi]
ES : ECMA registry
PT2 : ECMA registry
INIS-cyrillic : ECMA registry
NF_Z_62-010_(1973) : ECMA registry
greek-ccitt : ECMA registry
EBCDIC-AT-DE : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
JIS_C6229-1984-b-add : ECMA registry
Big5 : Chinese for Taiwan Multi-byte set.
MSZ_7795.3 : ECMA registry
JIS_C6220-1969-ro : ECMA registry
videotex-suppl : ECMA registry
HP-Math8 : PCL 5 Comparison Guide, Hewlett-Packard,
IBM01142 : IBM See (.../assignments/character-set-info/IBM01142) [Mahdi]
HP-DeskTop : PCL 5 Comparison Guide, Hewlett-Packard,
ISO_8859-6-I : RFC-1556
IBM00924 : IBM See (.../assignments/character-set-info/IBM00924) [Mahdi]
JIS_C6229-1984-kana : ECMA registry
IBM277 : IBM NLS RM Vol2 SE09-8002-01, March 1990
JUS_I.B1.003-serb : ECMA registry
IBM870 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM871 : IBM NLS RM Vol2 SE09-8002-01, March 1990
EBCDIC-FI-SE-A : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
IBM903 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM904 : IBM NLS RM Vol2 SE09-8002-01, March 1990
VIQR : RFC 1456
EBCDIC-PT : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
JIS_C6220-1969-jp : ECMA registry
ISO_10367-box : ECMA registry
JIS_C6229-1984-hand-add : ECMA registry
PC8-Danish-Norwegian : PC Danish Norwegian
KS_C_5601-1987 : ECMA registry
iso-ir-90 : ECMA registry
greek7-old : ECMA registry
us-dk :
ISO-8859-1-Windows-3.1-Latin-1 : Extended ISO 8859-1 Latin-1 for Windows 3.1.
IBM918 : IBM NLS RM Vol2 SE09-8002-01, March 1990
hp-roman8 : LaserJet IIP Printer User's Manual,
IBM905 : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
ISO_2033-1983 : ECMA registry
IBM-Thai : Presentation Set, CPGID: 838
NATS-DANO : ECMA registry
IBM868 : IBM NLS RM Vol2 SE09-8002-01, March 1990
IBM297 : IBM NLS RM Vol2 SE09-8002-01, March 1990
Latin-greek-1 : ECMA registry
EBCDIC-US : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
IBM423 : IBM NLS RM Vol2 SE09-8002-01, March 1990
ISO_6937-2-25 : ECMA registry
ES2 : ECMA registry
NATS-SEFI : ECMA registry
KSC5636 :
ISO-10646-Unicode-Latin1 : ISO Latin-1 subset of Unicode. Basic Latin and Latin-1
GB_2312-80 : ECMA registry
HP-Legal : PCL 5 Comparison Guide, Hewlett-Packard,
ISO_8859-6-E : RFC-1556
Extended_UNIX_Code_Packed_Format_for_Japanese : Standardized by OSF, UNIX International, and UNIX Systems
ISO_646.basic:1983 : ECMA registry
INIS-8 : ECMA registry
JIS_C6229-1984-hand : ECMA registry
EBCDIC-DK-NO : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
VISCII : RFC 1456
INIS : ECMA registry
PT : ECMA registry
Ventura-US : Ventura US. ASCII plus characters typically used in
CSN_369103 : ECMA registry
JIS_C6226-1978 : ECMA registry
IBM891 : IBM NLS RM Vol2 SE09-8002-01, March 1990
dk-us :
EBCDIC-IT : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
UNICODE-1-1 : RFC 1641
UNICODE-1-1-UTF-7 : RFC 1642
JIS_C6229-1984-a : ECMA registry
INVARIANT :
HP-Pi-font : PCL 5 Comparison Guide, Hewlett-Packard,
NATS-SEFI-ADD : ECMA registry
IBM038 : IBM 3174 Character Set Ref, GA27-3831-02, March 1990
T.61-7bit : ECMA registry
IEC_P27-1 : ECMA registry
ISO-10646-J-1 : ISO 10646 Japanese, see RFC 1815.
Shift_JIS : This charset is an extension of csHalfWidthKatakana by
EBCDIC-DK-NO-A : IBM 3270 Char Set Ref Ch 10, GA27-2837-9, April 1987
NC_NC00-10:81 : ECMA registry
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/