[Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Sun Jun 17 08:02:02 EDT 2018

Folks.  There are standards.  "1252" *is not* an alias for
"windows-1252" according to the IANA, while "866" *is* an alias for
"IBM866" according to the same authority.  Most 3-digit "IBMxxx" ARE
aliased to both "cpxxx" and just "xxx", but not all.  None of
"IBM874", "874", or "cp874" exists according to the IANA.


For the reasons Steven gave, I would say omit the digits-only aliases,
but if we must use them because "there's a standard" (or backward
compatibility), we should stick to those defined by standard, and only

If we're following other standards that I'm unaware of, fine, but
let's cite them rather than randomly introduce a plethora of aliases
because they "look like" an existing (and unfortunate) standard.

There's also some other weirdness with "windows-874", see below.  We
(somebody) should check other "windows-xxx" character sets to make
sure they're not misnamed "cpxxx".

Steven D'Aprano writes:
 > > It is easy to test it. Encoding/decoding with '874' should give the 
 > > same result as with 'cp874'.
 > I know it is too late to remove that feature, but why do we support 
 > digit-only IDs for encodings? They can be ambiguous. If Wikipedia is 
 > correct, cp874 (also known as ibm874) and Windows-874 (also known as 
 > cp1162) are different:

According to the IANA, they're not necessarily ambiguous.  Here is
the entry for IBM866:

IBM866 	2086 	IBM NLDG Volume 2	 	cp866
                (SE09-8002-03) August 1994      866
 	        [Rick_Pond]                     csIBM866

where the entries in column 4 show the registered aliases.  There are
at least a dozen IBMxxx character sets with 'xxx' aliases.

I don't understand what's with "cp874", though.  We can surely take
that one back, although we'd better hurry if it's in 3.7rc.  We might
want to add "windows-874" (which does't seem to be present in Python
3.6), since that's the standard character set name per IANA.

The confusion between cp874 and windows-874 may be because in
VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages

 > https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874
 > https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162

I don't know where Wikipedia's information comes from, but it's not
the IANA.

