Mailman 3 Re: [Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874" - Python-ideas

Stephen J. Turnbull

June 2018

5:02 a.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Folks. There are standards. "1252" *is not* an alias for "windows-1252" according to the IANA, while "866" *is* an alias for "IBM866" according to the same authority. Most 3-digit "IBMxxx" ARE aliased to both "cpxxx" and just "xxx", but not all. None of "IBM874", "874", or "cp874" exists according to the IANA. https://www.iana.org/assignments/character-sets/character-sets.xhtml For the reasons Steven gave, I would say omit the digits-only aliases, but if we must use them because "there's a standard" (or backward compatibility), we should stick to those defined by standard, and only those. If we're following other standards that I'm unaware of, fine, but let's cite them rather than randomly introduce a plethora of aliases because they "look like" an existing (and unfortunate) standard. There's also some other weirdness with "windows-874", see below. We (somebody) should check other "windows-xxx" character sets to make sure they're not misnamed "cpxxx". Steven D'Aprano writes:

...

According to the IANA, they're not necessarily ambiguous. Here is the entry for IBM866: IBM866 2086 IBM NLDG Volume 2 cp866 (SE09-8002-03) August 1994 866 [Rick_Pond] csIBM866 where the entries in column 4 show the registered aliases. There are at least a dozen IBMxxx character sets with 'xxx' aliases. I don't understand what's with "cp874", though. We can surely take that one back, although we'd better hurry if it's in 3.7rc. We might want to add "windows-874" (which does't seem to be present in Python 3.6), since that's the standard character set name per IANA. The confusion between cp874 and windows-874 may be because in VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages there).

...

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162

I don't know where Wikipedia's information comes from, but it's not the IANA. -- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

Reply

Sign in to reply online Use email software

Steven D'Aprano

5:34 p.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

...

That doesn't mean that the bug is best fixed by adding an alias. If the error was failing to find encoding "ltain-1", would we add an alias or fix the spelling? If 874 is not an official alias, we should consider it a misspelling and fix the misspelling, not add an alias. But either way, the point Stephen is making is that even if 874 is a legitimate alias, that shouldn't give us carte blanche to add numeric aliases for every encoding.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

12:17 a.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Ronald Oussoren writes:

...

Almost all of the CPxxx encodings have multiple aliases[1], so I just don't see the point unless numeric-only code page designations are baked in to default "locales"[2] in official releases by major OS vendors. And probably not even then, since it should be easy enough to provide a proper "locale" and/or PYTHONIOENCODING setting. Of course we should help the reporter figure out what's going on and help them fix it with appropriate system configuration. If that doesn't work, then (and *only then*) we could think about doing a stupid thing. Footnotes: [1] Granted, "874" only has "windows-874" registered with the IANA, so it's kind of salient. Still, if numeric-only aliases were a "thing", surely we'd have heard about it by now---I first encountered Thai encodings in 1990 (ok, that was TIS 620, but windows-874 is basically TIS plus Microsoft punctuation extensions IIRC), Thais do use computers in their native language a lot. [2] Scare quotes to refer to appropriate platform facilities, as neither Windows nor Mac OS is strictly conformant to POSIX on this.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

7:50 a.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Ronald Oussoren writes:

...

The user shouldn’t have to do anything other than install Python. IMHO were doing something wrong when the python interpreter doesn’t start up with a default system configuration

There's no evidence in the issue that I can see that suggests that the user installed Python into the default system configuration. I see a bunch of Python developers who have no access to the OP's system configuration demonstrating that something that shouldn't work and never has worked doesn't work, then providing a patch to make it work. This despite the fact that the OP hasn't provided any configuration details that would confirm this is a system default setting. I wouldn't object to making it work if there were any evidence that it is a real problem that other users will encounter. But there isn't any such evidence yet, it's a non-standard alias according to Microsoft's own IANA registration, and Steven d'Aprano's argument that such aliases may be ambiguous is plausible, though I haven't seen confirmation it would be problem in practice.

...

(when the user explicitly sets a bogus PYTHONIOENCODING or locale all bets are off,

I'm assuming that is the case, based on the fact that none of my two ;-) Thai students ever had this problem, nor have I seen a report of this problem for any encoding in either Emacs or Python contexts since about 1990, nor has the OP posted anything about his/her configuration.

...

although even then warning about and then ignoring bad settings would be more userfriendly than the current behavior)

If Python is told to talk YTREWQ and it doesn't know how to talk YTREWQ, ignoring the problem is not possible if any input or output in YTREWQ is required. The program will crash with a much harder to understand error message describing "undecodable input" in an encoding the user doesn't expect. My own experience is that soldiering on is the least user- friendly thing to do, as typically there's a trivial change that the user can make to resolve the problem optimally. The obvious thing to do is to fall back to ASCII, which almost certainly is compatible with the terminal, the log files, and the user's eyes and brain, emit a warning, and quit. That is what we do. The warning seems OK: the OP also diagnosed the missing alias, likely with little trouble. Steve

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

June 2018

12:02 p.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Folks. There are standards. "1252" *is not* an alias for "windows-1252" according to the IANA, while "866" *is* an alias for "IBM866" according to the same authority. Most 3-digit "IBMxxx" ARE aliased to both "cpxxx" and just "xxx", but not all. None of "IBM874", "874", or "cp874" exists according to the IANA. https://www.iana.org/assignments/character-sets/character-sets.xhtml For the reasons Steven gave, I would say omit the digits-only aliases, but if we must use them because "there's a standard" (or backward compatibility), we should stick to those defined by standard, and only those. If we're following other standards that I'm unaware of, fine, but let's cite them rather than randomly introduce a plethora of aliases because they "look like" an existing (and unfortunate) standard. There's also some other weirdness with "windows-874", see below. We (somebody) should check other "windows-xxx" character sets to make sure they're not misnamed "cpxxx". Steven D'Aprano writes:

...

According to the IANA, they're not necessarily ambiguous. Here is the entry for IBM866: IBM866 2086 IBM NLDG Volume 2 cp866 (SE09-8002-03) August 1994 866 [Rick_Pond] csIBM866 where the entries in column 4 show the registered aliases. There are at least a dozen IBMxxx character sets with 'xxx' aliases. I don't understand what's with "cp874", though. We can surely take that one back, although we'd better hurry if it's in 3.7rc. We might want to add "windows-874" (which does't seem to be present in Python 3.6), since that's the standard character set name per IANA. The confusion between cp874 and windows-874 may be because in VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages there).

...

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162

I don't know where Wikipedia's information comes from, but it's not the IANA. -- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

Reply

Sign in to reply online Use email software

Steven D'Aprano

12:34 a.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

...

That doesn't mean that the bug is best fixed by adding an alias. If the error was failing to find encoding "ltain-1", would we add an alias or fix the spelling? If 874 is not an official alias, we should consider it a misspelling and fix the misspelling, not add an alias. But either way, the point Stephen is making is that even if 874 is a legitimate alias, that shouldn't give us carte blanche to add numeric aliases for every encoding.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

7:17 a.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Ronald Oussoren writes:

...

Almost all of the CPxxx encodings have multiple aliases[1], so I just don't see the point unless numeric-only code page designations are baked in to default "locales"[2] in official releases by major OS vendors. And probably not even then, since it should be easy enough to provide a proper "locale" and/or PYTHONIOENCODING setting. Of course we should help the reporter figure out what's going on and help them fix it with appropriate system configuration. If that doesn't work, then (and *only then*) we could think about doing a stupid thing. Footnotes: [1] Granted, "874" only has "windows-874" registered with the IANA, so it's kind of salient. Still, if numeric-only aliases were a "thing", surely we'd have heard about it by now---I first encountered Thai encodings in 1990 (ok, that was TIS 620, but windows-874 is basically TIS plus Microsoft punctuation extensions IIRC), Thais do use computers in their native language a lot. [2] Scare quotes to refer to appropriate platform facilities, as neither Windows nor Mac OS is strictly conformant to POSIX on this.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

June 2018

2:50 p.m.

New subject: [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Ronald Oussoren writes:

...

The user shouldn’t have to do anything other than install Python. IMHO were doing something wrong when the python interpreter doesn’t start up with a default system configuration

There's no evidence in the issue that I can see that suggests that the user installed Python into the default system configuration. I see a bunch of Python developers who have no access to the OP's system configuration demonstrating that something that shouldn't work and never has worked doesn't work, then providing a patch to make it work. This despite the fact that the OP hasn't provided any configuration details that would confirm this is a system default setting. I wouldn't object to making it work if there were any evidence that it is a real problem that other users will encounter. But there isn't any such evidence yet, it's a non-standard alias according to Microsoft's own IANA registration, and Steven d'Aprano's argument that such aliases may be ambiguous is plausible, though I haven't seen confirmation it would be problem in practice.

...

(when the user explicitly sets a bogus PYTHONIOENCODING or locale all bets are off,

I'm assuming that is the case, based on the fact that none of my two ;-) Thai students ever had this problem, nor have I seen a report of this problem for any encoding in either Emacs or Python contexts since about 1990, nor has the OP posted anything about his/her configuration.

...

although even then warning about and then ignoring bad settings would be more userfriendly than the current behavior)

If Python is told to talk YTREWQ and it doesn't know how to talk YTREWQ, ignoring the problem is not possible if any input or output in YTREWQ is required. The program will crash with a much harder to understand error message describing "undecodable input" in an encoding the user doesn't expect. My own experience is that soldiering on is the least user- friendly thing to do, as typically there's a trivial change that the user can make to resolve the problem optimally. The obvious thing to do is to fall back to ASCII, which almost certainly is compatible with the terminal, the log files, and the user's eyes and brain, emit a warning, and quit. That is what we do. The warning seems OK: the OP also diagnosed the missing alias, likely with little trouble. Steve

Reply

Sign in to reply online Use email software

Re: [Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

Steven D'Aprano

Stephen J. Turnbull

Ronald Oussoren

Steven D'Aprano

Ronald Oussoren

Stephen J. Turnbull

Ronald Oussoren

Stephen J. Turnbull

Karthikeyan

Stephen J. Turnbull

Ronald Oussoren

Steven D'Aprano

Ronald Oussoren

Stephen J. Turnbull

Ronald Oussoren

Stephen J. Turnbull

Karthikeyan

tags

participants (4)