Provide UTF-8 version of Python for Windows.

Sorry for posting multiple threads so quickly. Microsoft provides UTF-8 code page for process. It can be enabled by manifest file. https://docs.microsoft.com/ja-jp/windows/uwp/design/globalizing/use-utf8-cod... How about providing Python binaris both of "UTF-8 version" and "ANSI version"? This idea can provide a more smooth transition of the default encoding. 1. Provide UTF-8 version since Python 3.10 2. (Some years later) Recommend UTF-8 version 3. (Some years later) Provide only UTF-8 version 4. (Some years later, maybe) Change the default encoding The upsides of this idea are: * We don't need to emit a warning for `open(filename)`. * We can see the download stats. Especially, the last point is a huge advantage compared to current UTF-8 mode (e.g. PYTHONUTF8=1). We can know how many users need legacy behavior in new Python versions. That is a very important information for us. Of course, there are some downsides: * Windows team needs to maintain more versions. * More divisions for "Python on Windows" environment. Regards, -- Inada Naoki <songofacandy@gmail.com>

Looks like that's only available for Microsoft Store apps only, so it might not be viable for Python.

As my understanding, "Fusion manifest for an unpackaged Win32 app" (*) works for non Store Apps too. (*) https://docs.microsoft.com/ja-jp/windows/uwp/design/globalizing/use-utf8-cod... -- Inada Naoki <songofacandy@gmail.com>

On Mon, Jan 25, 2021, at 22:49, William Pickard wrote:
Looks like that's only available for Microsoft Store apps only, so it might not be viable for Python.
I think the "Fusion manifest for an unpackaged Win32 app" part applies to non-store apps. [English version of the page: https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... ]

Aren't there too many different Windows installers already? I worry that it's too hard to choose which one to use (I know I had to ask another expert :-). On Mon, Jan 25, 2021 at 7:05 PM Inada Naoki <songofacandy@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 1/25/21, Inada Naoki <songofacandy@gmail.com> wrote:
I experimented with this manifest setting several months ago. To try it out, simply export the manifest from "python.exe", edit it to add the "activeCodePage" setting, and then replace it in "python.exe". The process active code page for GetACP() and GetOEMCP() is changed to UTF-8 (65001). The C runtime also overrides the user locale to UTF-8 if GetACP() returns UTF-8, i.e. setlocale(LC_CTYPE, "") will return "utf8" as the encoding. The console is hosted in a separate conhost.exe or openconsole.exe process, so it still defaults to the system OEM code page for its input and output code pages. This pertains only to low-level os.read() and os.write(). High-level console I/O uses io._WindowsConsoleIO for console files, which is internally UTF-16 and outwardly UTF-8.
* Windows team needs to maintain more versions.
I suppose the installer could install both sets of binaries, and copy to "python[w][_d].exe" based on an installer option. But then the UTF-8 selection statistics wouldn't be tracked, unless the installer phones home.

On Tue, Jan 26, 2021 at 4:01 PM Eryk Sun <eryksun@gmail.com> wrote:
Can pip send `locale.getpreferredencoding(False)` to PyPI? If so, we can set `PYTHONUTF8` environment variable from the installer too. Or we can provide small tool to set/unset `PYTHONUTF8` environment variable. -- Inada Naoki <songofacandy@gmail.com>

On 1/26/21, Eryk Sun <eryksun@gmail.com> wrote:
One concern is what to do for the special "ansi" and "oem" encodings. If scripts rely on them for IPC, such as with subprocess.Popen(), then it could be frustrating if they're just synonyms for UTF-8 (code page 65001). I've tested that it's possible for Python to peg "ansi" and "oem" to the system ANSI and OEM code pages via GetLocaleInfoEx() with LOCALE_NAME_SYSTEM_DEFAULT and the LCType constants LOCALE_IDEFAULTANSICODEPAGE and LOCALE_IDEFAULTCODEPAGE (OEM). But then they're no longer accurate within the current process, for which ANSI and OEM are UTF-8.

On Tue, Jan 26, 2021 at 4:36 PM Eryk Sun <eryksun@gmail.com> wrote:
You are right. That's why I didn't change the default encoding of subprocess in the PEP 597. UTF-8 version Python should change only default text encoding. So it shouldn't use UTF-8 code page. Current UTF-8 mode has the same problem. It affects PIPE encoding too. But we can change its behavior on Windows to: * The default encoding of TextIOWrapper and most wrappers (e.g. open(), Path.open(), Path.read_text(), gzip.open(), ...) become "utf-8". * locale.getpreferredencoding(False) returns code page encoding (e.g. "cp932") * subprocess module uses `locale.getpreferredencoding(False)` for the default PIPE encoding. And we can provide two versions of Python for Windows. * "Python (UTF-8 version)" will enable the UTF-8 mode by default. * "Python (ANSI version)" will disable the UTF-8 mode by default. User can override the default by `-Xutf8` option and `PYTHONUTF8` environment variable. Does this idea make sense? -- Inada Naoki <songofacandy@gmail.com>

On 26.01.2021 09:24, Inada Naoki wrote:
Just a word of warning: ANSI version in the Windows world usually means "this application doesn't support Unicode", so you'd probably not want to use this term. Overall, I think the approach with two different binaries is not going work well. Users will get confused and many problems will arise due to users installing the wrong version for the apps they use. We already let them choose between 64-bit and 32-bit and embedded vs. installer. Some may understand the consequences of installing a 32-bit version on a 64-bit OS, but I suppose most don't know what "embedded" is for (including myself :-)). If you add UTF-8 vs. Locale dimensions on top, you'd create even more confusion. I think it would be better to have the Windows installers get an option to set the PYTHONUTF8 env var for the user or system-wide. This would be off initially and default to on a few years later. Note: Such a setting would also affect other Python versions on the system, so the user should take care before enabling it. Alternatively, a new env var could be used, which older Python versions don't know anything about, e.g. PYTHONWINUTF8. Python would then treat this as an alias for PYTHONUTF8. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

nOn Tue, Jan 26, 2021 at 5:53 PM M.-A. Lemburg <mal@egenix.com> wrote:
OK. Confusing users only for getting stats is a bad idea...
I like the idea, and it is almost the same to what I proposed in PEP 597 (2nd) in last year. https://discuss.python.org/t/pep-597-enable-utf-8-mode-by-default-on-windows... The difference with the PEP 597 (2nd) is now I propose to not change the subprocess.PIPE encoding in UTF-8 mode. I need to reconsider about stdin/stdout encoding when they are redirected. Maybe, we can use GetConsoleCP() for stdin encoding, and GetConsoleOutputCP() for output encoding. I will write another proposal for it. -- Inada Naoki <songofacandy@gmail.com>

Looks like that's only available for Microsoft Store apps only, so it might not be viable for Python.

As my understanding, "Fusion manifest for an unpackaged Win32 app" (*) works for non Store Apps too. (*) https://docs.microsoft.com/ja-jp/windows/uwp/design/globalizing/use-utf8-cod... -- Inada Naoki <songofacandy@gmail.com>

On Mon, Jan 25, 2021, at 22:49, William Pickard wrote:
Looks like that's only available for Microsoft Store apps only, so it might not be viable for Python.
I think the "Fusion manifest for an unpackaged Win32 app" part applies to non-store apps. [English version of the page: https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... ]

Aren't there too many different Windows installers already? I worry that it's too hard to choose which one to use (I know I had to ask another expert :-). On Mon, Jan 25, 2021 at 7:05 PM Inada Naoki <songofacandy@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On 1/25/21, Inada Naoki <songofacandy@gmail.com> wrote:
I experimented with this manifest setting several months ago. To try it out, simply export the manifest from "python.exe", edit it to add the "activeCodePage" setting, and then replace it in "python.exe". The process active code page for GetACP() and GetOEMCP() is changed to UTF-8 (65001). The C runtime also overrides the user locale to UTF-8 if GetACP() returns UTF-8, i.e. setlocale(LC_CTYPE, "") will return "utf8" as the encoding. The console is hosted in a separate conhost.exe or openconsole.exe process, so it still defaults to the system OEM code page for its input and output code pages. This pertains only to low-level os.read() and os.write(). High-level console I/O uses io._WindowsConsoleIO for console files, which is internally UTF-16 and outwardly UTF-8.
* Windows team needs to maintain more versions.
I suppose the installer could install both sets of binaries, and copy to "python[w][_d].exe" based on an installer option. But then the UTF-8 selection statistics wouldn't be tracked, unless the installer phones home.

On Tue, Jan 26, 2021 at 4:01 PM Eryk Sun <eryksun@gmail.com> wrote:
Can pip send `locale.getpreferredencoding(False)` to PyPI? If so, we can set `PYTHONUTF8` environment variable from the installer too. Or we can provide small tool to set/unset `PYTHONUTF8` environment variable. -- Inada Naoki <songofacandy@gmail.com>

On 1/26/21, Eryk Sun <eryksun@gmail.com> wrote:
One concern is what to do for the special "ansi" and "oem" encodings. If scripts rely on them for IPC, such as with subprocess.Popen(), then it could be frustrating if they're just synonyms for UTF-8 (code page 65001). I've tested that it's possible for Python to peg "ansi" and "oem" to the system ANSI and OEM code pages via GetLocaleInfoEx() with LOCALE_NAME_SYSTEM_DEFAULT and the LCType constants LOCALE_IDEFAULTANSICODEPAGE and LOCALE_IDEFAULTCODEPAGE (OEM). But then they're no longer accurate within the current process, for which ANSI and OEM are UTF-8.

On Tue, Jan 26, 2021 at 4:36 PM Eryk Sun <eryksun@gmail.com> wrote:
You are right. That's why I didn't change the default encoding of subprocess in the PEP 597. UTF-8 version Python should change only default text encoding. So it shouldn't use UTF-8 code page. Current UTF-8 mode has the same problem. It affects PIPE encoding too. But we can change its behavior on Windows to: * The default encoding of TextIOWrapper and most wrappers (e.g. open(), Path.open(), Path.read_text(), gzip.open(), ...) become "utf-8". * locale.getpreferredencoding(False) returns code page encoding (e.g. "cp932") * subprocess module uses `locale.getpreferredencoding(False)` for the default PIPE encoding. And we can provide two versions of Python for Windows. * "Python (UTF-8 version)" will enable the UTF-8 mode by default. * "Python (ANSI version)" will disable the UTF-8 mode by default. User can override the default by `-Xutf8` option and `PYTHONUTF8` environment variable. Does this idea make sense? -- Inada Naoki <songofacandy@gmail.com>

On 26.01.2021 09:24, Inada Naoki wrote:
Just a word of warning: ANSI version in the Windows world usually means "this application doesn't support Unicode", so you'd probably not want to use this term. Overall, I think the approach with two different binaries is not going work well. Users will get confused and many problems will arise due to users installing the wrong version for the apps they use. We already let them choose between 64-bit and 32-bit and embedded vs. installer. Some may understand the consequences of installing a 32-bit version on a 64-bit OS, but I suppose most don't know what "embedded" is for (including myself :-)). If you add UTF-8 vs. Locale dimensions on top, you'd create even more confusion. I think it would be better to have the Windows installers get an option to set the PYTHONUTF8 env var for the user or system-wide. This would be off initially and default to on a few years later. Note: Such a setting would also affect other Python versions on the system, so the user should take care before enabling it. Alternatively, a new env var could be used, which older Python versions don't know anything about, e.g. PYTHONWINUTF8. Python would then treat this as an alias for PYTHONUTF8. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Jan 26 2021)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

nOn Tue, Jan 26, 2021 at 5:53 PM M.-A. Lemburg <mal@egenix.com> wrote:
OK. Confusing users only for getting stats is a bad idea...
I like the idea, and it is almost the same to what I proposed in PEP 597 (2nd) in last year. https://discuss.python.org/t/pep-597-enable-utf-8-mode-by-default-on-windows... The difference with the PEP 597 (2nd) is now I propose to not change the subprocess.PIPE encoding in UTF-8 mode. I need to reconsider about stdin/stdout encoding when they are redirected. Maybe, we can use GetConsoleCP() for stdin encoding, and GetConsoleOutputCP() for output encoding. I will write another proposal for it. -- Inada Naoki <songofacandy@gmail.com>
participants (6)
-
Eryk Sun
-
Guido van Rossum
-
Inada Naoki
-
M.-A. Lemburg
-
Random832
-
William Pickard