Re: [Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

It is easy to test it. Encoding/decoding with '874' should give the same result as with 'cp874'.
I know it is too late to remove that feature, but why do we support digit-only IDs for encodings? They can be ambiguous. If Wikipedia is correct, cp874 (also known as ibm874) and Windows-874 (also known as cp1162) are different: https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874 https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162 -- Steve

Folks. There are standards. "1252" *is not* an alias for "windows-1252" according to the IANA, while "866" *is* an alias for "IBM866" according to the same authority. Most 3-digit "IBMxxx" ARE aliased to both "cpxxx" and just "xxx", but not all. None of "IBM874", "874", or "cp874" exists according to the IANA. https://www.iana.org/assignments/character-sets/character-sets.xhtml For the reasons Steven gave, I would say omit the digits-only aliases, but if we must use them because "there's a standard" (or backward compatibility), we should stick to those defined by standard, and only those. If we're following other standards that I'm unaware of, fine, but let's cite them rather than randomly introduce a plethora of aliases because they "look like" an existing (and unfortunate) standard. There's also some other weirdness with "windows-874", see below. We (somebody) should check other "windows-xxx" character sets to make sure they're not misnamed "cpxxx". Steven D'Aprano writes:
According to the IANA, they're not necessarily ambiguous. Here is the entry for IBM866: IBM866 2086 IBM NLDG Volume 2 cp866 (SE09-8002-03) August 1994 866 [Rick_Pond] csIBM866 where the entries in column 4 show the registered aliases. There are at least a dozen IBMxxx character sets with 'xxx' aliases. I don't understand what's with "cp874", though. We can surely take that one back, although we'd better hurry if it's in 3.7rc. We might want to add "windows-874" (which does't seem to be present in Python 3.6), since that's the standard character set name per IANA. The confusion between cp874 and windows-874 may be because in VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages there).
https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874
https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162
I don't know where Wikipedia's information comes from, but it's not the IANA. -- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

Sure, but for at least one user Python 3.6 fails to start because initialising the sys.std* streams fails due to not finding a “874” encoding. The user sadly enough didn’t provide more information on his machine, other than that it is running some version of Windows. BTW. “cp874” does exist according to the unicode consortium: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT <https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT>, and appears to be a codepage for a (the?) Thai language. The user might therefore be running Windows with a Thai locale. Ronald

That doesn't mean that the bug is best fixed by adding an alias. If the error was failing to find encoding "ltain-1", would we add an alias or fix the spelling? If 874 is not an official alias, we should consider it a misspelling and fix the misspelling, not add an alias. But either way, the point Stephen is making is that even if 874 is a legitimate alias, that shouldn't give us carte blanche to add numeric aliases for every encoding.

I agree, I’ve mentioned in the issue that I’d like to understand why python looks for an encoding with this name.
That depends, if a major platform ships with locales where the encoding is misspelled we have little choice but to add an alias. To state it too blunt: standards are fine until they conflict with reality.
Possibly just for the “cp…” encodings, but IMHO only if we confirm that the code to look for the preferred encoding returns a codepage number on Windows and changing that code leads to worse results than adding numeric aliases for the “cp…” encodings. Ronald

Ronald Oussoren writes:
Almost all of the CPxxx encodings have multiple aliases[1], so I just don't see the point unless numeric-only code page designations are baked in to default "locales"[2] in official releases by major OS vendors. And probably not even then, since it should be easy enough to provide a proper "locale" and/or PYTHONIOENCODING setting. Of course we should help the reporter figure out what's going on and help them fix it with appropriate system configuration. If that doesn't work, then (and *only then*) we could think about doing a stupid thing. Footnotes: [1] Granted, "874" only has "windows-874" registered with the IANA, so it's kind of salient. Still, if numeric-only aliases were a "thing", surely we'd have heard about it by now---I first encountered Thai encodings in 1990 (ok, that was TIS 620, but windows-874 is basically TIS plus Microsoft punctuation extensions IIRC), Thais do use computers in their native language a lot. [2] Scare quotes to refer to appropriate platform facilities, as neither Windows nor Mac OS is strictly conformant to POSIX on this.

Ronald Oussoren writes:
There's no evidence in the issue that I can see that suggests that the user installed Python into the default system configuration. I see a bunch of Python developers who have no access to the OP's system configuration demonstrating that something that shouldn't work and never has worked doesn't work, then providing a patch to make it work. This despite the fact that the OP hasn't provided any configuration details that would confirm this is a system default setting. I wouldn't object to making it work if there were any evidence that it is a real problem that other users will encounter. But there isn't any such evidence yet, it's a non-standard alias according to Microsoft's own IANA registration, and Steven d'Aprano's argument that such aliases may be ambiguous is plausible, though I haven't seen confirmation it would be problem in practice.
(when the user explicitly sets a bogus PYTHONIOENCODING or locale all bets are off,
I'm assuming that is the case, based on the fact that none of my two ;-) Thai students ever had this problem, nor have I seen a report of this problem for any encoding in either Emacs or Python contexts since about 1990, nor has the OP posted anything about his/her configuration.
although even then warning about and then ignoring bad settings would be more userfriendly than the current behavior)
If Python is told to talk YTREWQ and it doesn't know how to talk YTREWQ, ignoring the problem is not possible if any input or output in YTREWQ is required. The program will crash with a much harder to understand error message describing "undecodable input" in an encoding the user doesn't expect. My own experience is that soldiering on is the least user- friendly thing to do, as typically there's a trivial change that the user can make to resolve the problem optimally. The obvious thing to do is to fall back to ASCII, which almost certainly is compatible with the terminal, the log files, and the user's eyes and brain, emit a warning, and quit. That is what we do. The warning seems OK: the OP also diagnosed the missing alias, likely with little trouble. Steve

This page <https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).a...> also lists 874 along with windows-874 as .NET name belonging to Thai language and doesn't mention cp-874. I don't have knowledge of .NET but just wanted to add this as a reference. One another disadvantage of patching the search function (or adding any alias for digit only encoding assuming cpXXXX) is that it prepends "cp" and it also assumes that aliases.py that takes precedence doesn't resolve correctly. Since some of the digit only encodings like '936' that corresponds to 'gbk' are added in aliases.py they don't get resolved as 'cp936' for now. But if new digit only and non-cp encodings are added in future then they have to be added to the file so that precedence works instead of always resolving to cpXXXX encoding. I think this is noted at https://bugs.python.org/issue33865#msg319617. It would be nice if the original poster provided some more context or environment to reproduce it than the screenshot which has limited information. I am keeping aside the search_function.patch and look forward to OP to reply back in the issue. Thanks PS : This is my first mailing list post. Kindly ignore if I am using wrong quoting mechanism. On Monday, June 18, 2018 at 12:01:01 AM UTC+5:30, Ronald Oussoren wrote:

Folks. There are standards. "1252" *is not* an alias for "windows-1252" according to the IANA, while "866" *is* an alias for "IBM866" according to the same authority. Most 3-digit "IBMxxx" ARE aliased to both "cpxxx" and just "xxx", but not all. None of "IBM874", "874", or "cp874" exists according to the IANA. https://www.iana.org/assignments/character-sets/character-sets.xhtml For the reasons Steven gave, I would say omit the digits-only aliases, but if we must use them because "there's a standard" (or backward compatibility), we should stick to those defined by standard, and only those. If we're following other standards that I'm unaware of, fine, but let's cite them rather than randomly introduce a plethora of aliases because they "look like" an existing (and unfortunate) standard. There's also some other weirdness with "windows-874", see below. We (somebody) should check other "windows-xxx" character sets to make sure they're not misnamed "cpxxx". Steven D'Aprano writes:
According to the IANA, they're not necessarily ambiguous. Here is the entry for IBM866: IBM866 2086 IBM NLDG Volume 2 cp866 (SE09-8002-03) August 1994 866 [Rick_Pond] csIBM866 where the entries in column 4 show the registered aliases. There are at least a dozen IBMxxx character sets with 'xxx' aliases. I don't understand what's with "cp874", though. We can surely take that one back, although we'd better hurry if it's in 3.7rc. We might want to add "windows-874" (which does't seem to be present in Python 3.6), since that's the standard character set name per IANA. The confusion between cp874 and windows-874 may be because in VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages there).
https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874
https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162
I don't know where Wikipedia's information comes from, but it's not the IANA. -- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

Sure, but for at least one user Python 3.6 fails to start because initialising the sys.std* streams fails due to not finding a “874” encoding. The user sadly enough didn’t provide more information on his machine, other than that it is running some version of Windows. BTW. “cp874” does exist according to the unicode consortium: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT <https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT>, and appears to be a codepage for a (the?) Thai language. The user might therefore be running Windows with a Thai locale. Ronald

That doesn't mean that the bug is best fixed by adding an alias. If the error was failing to find encoding "ltain-1", would we add an alias or fix the spelling? If 874 is not an official alias, we should consider it a misspelling and fix the misspelling, not add an alias. But either way, the point Stephen is making is that even if 874 is a legitimate alias, that shouldn't give us carte blanche to add numeric aliases for every encoding.

I agree, I’ve mentioned in the issue that I’d like to understand why python looks for an encoding with this name.
That depends, if a major platform ships with locales where the encoding is misspelled we have little choice but to add an alias. To state it too blunt: standards are fine until they conflict with reality.
Possibly just for the “cp…” encodings, but IMHO only if we confirm that the code to look for the preferred encoding returns a codepage number on Windows and changing that code leads to worse results than adding numeric aliases for the “cp…” encodings. Ronald

Ronald Oussoren writes:
Almost all of the CPxxx encodings have multiple aliases[1], so I just don't see the point unless numeric-only code page designations are baked in to default "locales"[2] in official releases by major OS vendors. And probably not even then, since it should be easy enough to provide a proper "locale" and/or PYTHONIOENCODING setting. Of course we should help the reporter figure out what's going on and help them fix it with appropriate system configuration. If that doesn't work, then (and *only then*) we could think about doing a stupid thing. Footnotes: [1] Granted, "874" only has "windows-874" registered with the IANA, so it's kind of salient. Still, if numeric-only aliases were a "thing", surely we'd have heard about it by now---I first encountered Thai encodings in 1990 (ok, that was TIS 620, but windows-874 is basically TIS plus Microsoft punctuation extensions IIRC), Thais do use computers in their native language a lot. [2] Scare quotes to refer to appropriate platform facilities, as neither Windows nor Mac OS is strictly conformant to POSIX on this.


Ronald Oussoren writes:
There's no evidence in the issue that I can see that suggests that the user installed Python into the default system configuration. I see a bunch of Python developers who have no access to the OP's system configuration demonstrating that something that shouldn't work and never has worked doesn't work, then providing a patch to make it work. This despite the fact that the OP hasn't provided any configuration details that would confirm this is a system default setting. I wouldn't object to making it work if there were any evidence that it is a real problem that other users will encounter. But there isn't any such evidence yet, it's a non-standard alias according to Microsoft's own IANA registration, and Steven d'Aprano's argument that such aliases may be ambiguous is plausible, though I haven't seen confirmation it would be problem in practice.
(when the user explicitly sets a bogus PYTHONIOENCODING or locale all bets are off,
I'm assuming that is the case, based on the fact that none of my two ;-) Thai students ever had this problem, nor have I seen a report of this problem for any encoding in either Emacs or Python contexts since about 1990, nor has the OP posted anything about his/her configuration.
although even then warning about and then ignoring bad settings would be more userfriendly than the current behavior)
If Python is told to talk YTREWQ and it doesn't know how to talk YTREWQ, ignoring the problem is not possible if any input or output in YTREWQ is required. The program will crash with a much harder to understand error message describing "undecodable input" in an encoding the user doesn't expect. My own experience is that soldiering on is the least user- friendly thing to do, as typically there's a trivial change that the user can make to resolve the problem optimally. The obvious thing to do is to fall back to ASCII, which almost certainly is compatible with the terminal, the log files, and the user's eyes and brain, emit a warning, and quit. That is what we do. The warning seems OK: the OP also diagnosed the missing alias, likely with little trouble. Steve

This page <https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).a...> also lists 874 along with windows-874 as .NET name belonging to Thai language and doesn't mention cp-874. I don't have knowledge of .NET but just wanted to add this as a reference. One another disadvantage of patching the search function (or adding any alias for digit only encoding assuming cpXXXX) is that it prepends "cp" and it also assumes that aliases.py that takes precedence doesn't resolve correctly. Since some of the digit only encodings like '936' that corresponds to 'gbk' are added in aliases.py they don't get resolved as 'cp936' for now. But if new digit only and non-cp encodings are added in future then they have to be added to the file so that precedence works instead of always resolving to cpXXXX encoding. I think this is noted at https://bugs.python.org/issue33865#msg319617. It would be nice if the original poster provided some more context or environment to reproduce it than the screenshot which has limited information. I am keeping aside the search_function.patch and look forward to OP to reply back in the issue. Thanks PS : This is my first mailing list post. Kindly ignore if I am using wrong quoting mechanism. On Monday, June 18, 2018 at 12:01:01 AM UTC+5:30, Ronald Oussoren wrote:
participants (4)
-
Karthikeyan
-
Ronald Oussoren
-
Stephen J. Turnbull
-
Steven D'Aprano