
Hi, all. I believe UTF-8 should be chosen by default for text encoding. * The default encoding for Python source file is UTF-8. * VS Code and even notepad uses UTF-8 by default. * Text files downloaded from the Internet is probably UTF-8. * UTF-8 is used when you use WSL regardless your system code page. * Windows 10 (1903) adds per-process option to change active code page to UTF-8 and call the system code page "legacy". [1] But it is difficult to change the default text encoding of Python for backward compatibility. So I want to recommend the UTF-8 mode: * The default text encoding become UTF-8 * When you need to use legacy ANSI code page, you can use "mbcs" codec. * You can disable it when you need to run Python application relying on the legacy system encoding. But it is not well known yet. And setting the environment variable is a bit difficult for people who are learning programming with Python. So I want to propose this: 1. Recommend it in the official document "Using Python on Windows" [2]. 2. Show the UTF-8 mode status in the command line mode header [3] on Windows. 3. Show the link to the UTF-8 mode document in the command line mode header too. 4. Add checkbox to set "PYTHONUTF8=1" environment variable in the installer. How do you think? If setting "PYTHONUTF8=1" environment variable is too danger to recommend widely, we may be able to add per-installation (and per-venv if needed) option file (site.cfg in the directory same to python.exe) to enable UTF-8 mode. But it may make Python startup process more complex... Regards, [1]: https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-cod... [2]: https://docs.python.org/3/using/windows.html [3]: Currently, Python version and "Type "help",..." are printed. -- Inada Naoki <songofacandy@gmail.com>

On Jan 10, 2020, at 03:45, Inada Naoki <songofacandy@gmail.com> wrote:
Hi, all.
I believe UTF-8 should be chosen by default for text encoding.
Correct me if I’m wrong, but I think in Python 3.7 on Windows 10, the filesystem encoding is already UTF-8, and the stdio console files are UTF-8 (but under the covers actually wrap the native UTF-16 console APIs instead of using msvcrt stdio), so the only issue is the locale encoding, right? Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if you set it on Windows, right? If so, you need to also add support for it, not just set it in the installer. And presumably you also want to add the equivalent command-line argument to Windows also One last thing: On Linux, you often use the locale coercion feature instead of the assume-UTF-8 feature. (For example, if you’re running a subprocess and want to ensure its stdout is UTF-8…) Is there an equivalent issue for Windows, or a very different but equally important one that needs to be solved differently, or is there just nothing relevant here?
* Windows 10 (1903) adds per-process option to change active code page to UTF-8 and call the system code page "legacy".
If you do that, won’t Python 3.7 already use UTF-8 for the locale, because the active code page is what it sets the startup value to match?
If you’ve used the Windows 10 feature you mentioned above, won’t this just select the same UTF-8 you’re already using? Or are you suggesting that Python’s mbcs codec should also change to (on Windows when UTF8 is enabled) use “legacy” if it exists and only otherwise use actual “mbcs”? Or that nobody should use this Windows feature on Python 3.8+?

On Sat, Jan 11, 2020 at 2:30 AM Andrew Barnert <abarnert@yahoo.com> wrote:
You're right. It is used by default in many places. Some examples: * Opening text files: open("README.md") * Pipe in text mode: subprocess.check_output(["ls", "-l"], text=True)
Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if you set it on Windows, right? If so, you need to also add support for it, not just set it in the installer.
PYTHONUTF8 is supported on Windows already. You can use "set PYTHONUTF8=1" to enable UTF-8 mode.
One last thing: On Linux, you often use the locale coercion feature instead of the assume-UTF-8 feature. (For example, if you’re running a subprocess and want to ensure its stdout is UTF-8…) Is there an equivalent issue for Windows, or a very different but equally important one that needs to be solved differently, or is there just nothing relevant here?
On Windows, there is no way to ensure subprocess to use UTF-8. * Some application always use UTF-8. * Some application always use legacy encoding. * Some application checks GetConsoleOutputCP. (CLI only) * Some application have their own setting for stdout encoding. (e.g. PowerShell Core)
I don't do that. And I don't think we should do this: * It can be used only in Windows 10 1903~ * Setting manifest is harder than setting an environment variable. It is too difficult to opt-inout. * It makes "mbcs" encoding to UTF-8 too. There is no way to use legacy encoding explicitly. So I think UTF-8 mode is better than this Windows feature. -- Inada Naoki <songofacandy@gmail.com>

On 1/10/20, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The implementation of UTF-8 mode (i.e. -Xutf8) is cross-platform, though I think it could use some tweaking for Windows.
Yes, 3.6+ in Windows defaults to UTF-8 for console I/O and the filesystem encoding. If for some reason you need the legacy behavior, it can be enabled via the following environment variables [1]: PYTHONLEGACYWINDOWSSTDIO and PYTHONLEGACYWINDOWSFSENCODING. Setting PYTHONLEGACYWINDOWSFSENCODING switches the filesystem encoding to "mbcs". Note that this does not use the system MBS (multibyte string) API. Python simply transcodes between UTF-16 and ANSI instead of UTF-8. Currently this setting takes precedence over UTF-8 mode, but I think it should be the other way around. Setting PYTHONLEGACYWINDOWSSTDIO uses the console input codepage for stdin and the console output codepage for stdout and stderr, but only if isatty is true and the process is attached to a console (see _Py_device_encoding in Python/fileutils.c). Otherwise it uses the system ANSI codepage. Note that this setting is currently **broken** in 3.8. In Python/initconfig.c, config_init_stdio_encoding calls config_get_locale_encoding to set config->stdio_encoding. This always uses the system ANSI codepage (e.g. 1252), even for console files for which this choice makes no sense. Combining UTF-8 mode with legacy Windows standard I/O is generally dysfunctional. The result is mojibake, unless the console codepage happens to be UTF-8. I'd prefer UTF-8 mode to take precedence over legacy standard I/O mode and have it imply non-legacy I/O. In both of the above cases, what I'd prefer is for UTF-8 mode to take precedence over legacy modes, i.e. to disable config->legacy_windows_fs_encoding and config->legacy_windows_stdio in the startup configuration. Regarding the MBS API and UTF-8 In Windows 10, it's possible to set the ANSI and OEM codepages to UTF-8 at both the system level (in the system control panel) and the application level (in the application manifest). But many functions are still only available in the WCS (wide-character string) API, such as GetLocaleInfoEx, GetFileInformationByHandleEx, and SetFileInformationByHandle. I don't know whether Microsoft plans to implement MBS wrappers in these cases. If the ANSI codepage is UTF-8, then the MBS file API (e.g. CreateFileA) is basically equivalent to Python's UTF-8 filesystem encoding. There's one exception. Python uses the "surrogatepass" error handler, which allows invalid surrogate codes (i.e. a "Wobbly" WTF-8 encoding). In contrast, the MBS API translates invalid surrogates to the replacement character (U+FFFD). I think Python's choice is more sensible because the WCS file API (e.g. CreateFileW) and filesystem drivers do not verify that strings are valid Unicode. The console uses the system OEM codepage as its default I/O codepage. Setting OEM to UTF-8 (at the system level, not at the application level), or manually setting the codepage to UTF-8 via `chcp.com 65001`, is a potential problem because the console doesn't support reading non-ASCII UTF-8 strings via ReadFile or ReadConsoleA. Prior to Windows 10, it returns an empty string for this case, which looks like EOF. The new console in Windows 10 instead translates each non-ASCII character as a null byte (e.g. "SPĀM" -> "SP\x00M"), which is better but still pretty much useless for reading non-English input. Python 3.6+ is for the most part immune to this. In the default configuration, it uses ReadConsoleW to read UTF-16 instead of relying on the input codepage. (Low-level os.read is not immune to the problem, however, because it is not integrated with the new console I/O implementation.) [1] https://docs.python.org/3/using/cmdline.html#environment-variables

On Sun, Jan 12, 2020 at 9:32 PM Eryk Sun <eryksun@gmail.com> wrote:
UTF-8 mode shouldn't take precedence over legacy FS encoding. Mercurial uses legacy encoding for file paths. They use sys._enablelegacywindowsfsencoding() on Windows. https://www.mercurial-scm.org/repo/hg/rev/8d5489b048b7 Since Mercurial uses binary file almost always, I think the UTF-8 mode doesn't break Mercurial. But I'm not sure. (Note that Mercurial on Python 3 on Windows is still beta.) Regards, -- Inada Naoki <songofacandy@gmail.com>

How do you think?
At the least, I'm in favor of recommending UTF-8 mode in the documentation for Windows users. It seems like it would fit well under the "Configuring Python" section of the page ( https://docs.python.org/3/using/windows.html#configuring-python). I'm undecided on the others, as I don't know what (2) and (3) would specifically entail. As far as (4) goes:
Would you mind elaborating on this point? In particular, what specific dangers/risks might be associated with setting that specific env var during installation? IIRC, the installer already configures a few others by default (I don't recall their names).
But it may make Python startup process more complex...
I would definitely prefer to have a checkbox to configure "PYTHONUTF8" during installation rather than requiring it to be done manually, assuming it can be done safely and effectively across different systems. Not only for the sake of lowering complexity, but also because it takes less time and effort. That can add up significantly when you consider the volume of users. On Fri, Jan 10, 2020 at 6:47 AM Inada Naoki <songofacandy@gmail.com> wrote:

On Sat, Jan 11, 2020 at 11:03 AM Kyle Stanley <aeros167@gmail.com> wrote:
Current header is: Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. I'm proposing adding one more line: UTF-8 mode is disabled. (See https://url.to/utf8mode)
If the Python installer set PYTHONUTF8=1 environment variable, it may affects applications using embeddable Python or py2exe. So it may break applications which assume the default text encoding is "mbcs".
But it may make Python startup process more complex...
I would definitely prefer to have a checkbox to configure "PYTHONUTF8" during installation rather than requiring it to be done manually, assuming it can be done safely and effectively across different systems. Not only for the sake of lowering complexity, but also because it takes less time and effort. That can add up significantly when you consider the volume of users.
Even when we add per-installation config file, we can add it from the Python installer. -- Inada Naoki <songofacandy@gmail.com>

Inada Naoki wrote:
Current header is:
I'm proposing adding one more line:
UTF-8 mode is disabled. (See https://url.to/utf8mode)
Ah, that should be fine. I don't have any issues with (2) and (3) then. Inada Naoki wrote:
If the Python installer set PYTHONUTF8=1 environment variable, it may affects applications using embeddable Python or py2exe.
So it may break applications which assume the default text encoding is "mbcs".
In that case, I think we should hold off on (4) until it's tested for a decent variety of different applications, or at the least consider leaving it off by default (unticked checkbox) with a note of some form that warns users of the possible legacy incompatibility with MBCS. I'll admit though that I have no experience working with any application that implicitly assumes MBCS is the default encoding format, is this a common occurrence in some older applications? I could imagine this being the case in legacy applications that need international encoding, but I have zero idea of how common it is in reality. As a side note, Microsoft recommends not using MBCS for encoding in newer applications in their Visual C++ docs: "MBCS is a legacy technology and is not recommended for new development." ( https://docs.microsoft.com/en-us/cpp/text/support-for-multibyte-character-se... ). This might also be useful for getting a general idea of MBCS expectations when it comes to Windows development, although some of it is specific to C++: https://docs.microsoft.com/en-us/cpp/text/general-mbcs-programming-advice?vi... . On Sun, Jan 12, 2020 at 1:28 AM Inada Naoki <songofacandy@gmail.com> wrote:

Kyle Stanley writes:
At least in East Asia, it is. Eg, my university's student information system for faculty use allows downloading of class lists in CSV format. Until sometime last year it defaulted to Japan's annoying MBCS code page 932 ("Shift JIS"). It still allows Shift JIS, but at least it now defaults to UTF-8. I assume the former default means that the PHBs got less complaints from people whose software assumes UTF-8 than from those whose software assumes Shift JIS until recently. And this is one of the easy migrations, since almost everybody has used Excel exclusively for about 15 years.
As a side note, Microsoft recommends
Thank you for this information, but it's basically irrelevant. Encodings are no longer a technical problem (at least to my non-native eye Unicode applications are now more likely to display Japanese correctly and beautifully than the legacy apps), but rather a cultural and budgetary problem. There is still tons of data in legacy applications, both as text files and in various application data formats, that use legacy encodings (in Japanese, that means MBCS). Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on all the .txt files in sight. That WFM (well, I had to do a few .tex and .rst files too ;-), but most people are dependent on Word, Excel, and other application formats, and it's a PITA; VB scripting is a very rare skill except among the (of course overburdened) technical staff. Steve

On Jan 13, 2020, at 19:32, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
But those are binary formats, not something you can just read in as text in Python even if you know the encoding. And surely nobody is extracting the text out of those file formats manually anyway? Unless you’re actually working on a tool like wv or antiword, you just use one of those tools (or the Python wrappers around them), or talk to Word or Word Viewer over COM, and either way you just get Unicode. (And even if you are working on a tool like wv, while PITA is an understatement, the hard part is navigating the insane formats to find and order the text chunks; once you can do that, dealing with UCS2 vs. ANSI vs. ALTANSI chunks and knowing where to find the code pages for the latter two in the structure is the comparatively easy part.) I can think of one place where mbcs would be useful and you can’t just iconv. IIRC, with old-school RTF export you have to unescape and then convert and then reescape? But everything else I remember dealing with is either a binary format you need a library for, or plain text like TXT and CSV.) But even then, can you even rely on mbcs? I know it used to be a problem (early 00s) in many Japanese shops that you had Shift-JIS docs on Windows boxes or in Notes servers or whatever where the default codepage wasn’t Shift-JIS.

Andrew Barnert writes:
"My Name Is Nobody", aka The Exception to Prove the Rule. ;-)
True, but I'm really mostly talking about those "other application formats" such as Ichitaro, email clients I will not dirty the keyboard by typing their names, and more specialized stuff (stats packages, archivers, etc) whose names would not be familiar here.
But even then, can you even rely on mbcs?
No (we are discussing a nation that even today you may run into 5 different "native" encodings in the same day, after all), but on Windows what else were you going to use for default, if not UTF-8?
Never saw one of those. If the box was running Windows, IME the default code page was 932, until the late noughties, when UTF-8 to become common (although file systems were still mostly Shift JIS, to the enjoyment of all who weren't using Python 3 with PEP 393 ;-). If the box was running Unix, the encoding was usually packed EUC-JP, but I never dealt with Notes (banzai! the gods smiled on me). OTOH, I have seen filesystem paths on Sun boxen and in zipfiles with multiple encodings in them. :-) Steve

On Jan 10, 2020, at 03:45, Inada Naoki <songofacandy@gmail.com> wrote:
Hi, all.
I believe UTF-8 should be chosen by default for text encoding.
Correct me if I’m wrong, but I think in Python 3.7 on Windows 10, the filesystem encoding is already UTF-8, and the stdio console files are UTF-8 (but under the covers actually wrap the native UTF-16 console APIs instead of using msvcrt stdio), so the only issue is the locale encoding, right? Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if you set it on Windows, right? If so, you need to also add support for it, not just set it in the installer. And presumably you also want to add the equivalent command-line argument to Windows also One last thing: On Linux, you often use the locale coercion feature instead of the assume-UTF-8 feature. (For example, if you’re running a subprocess and want to ensure its stdout is UTF-8…) Is there an equivalent issue for Windows, or a very different but equally important one that needs to be solved differently, or is there just nothing relevant here?
* Windows 10 (1903) adds per-process option to change active code page to UTF-8 and call the system code page "legacy".
If you do that, won’t Python 3.7 already use UTF-8 for the locale, because the active code page is what it sets the startup value to match?
If you’ve used the Windows 10 feature you mentioned above, won’t this just select the same UTF-8 you’re already using? Or are you suggesting that Python’s mbcs codec should also change to (on Windows when UTF8 is enabled) use “legacy” if it exists and only otherwise use actual “mbcs”? Or that nobody should use this Windows feature on Python 3.8+?

On Sat, Jan 11, 2020 at 2:30 AM Andrew Barnert <abarnert@yahoo.com> wrote:
You're right. It is used by default in many places. Some examples: * Opening text files: open("README.md") * Pipe in text mode: subprocess.check_output(["ls", "-l"], text=True)
Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if you set it on Windows, right? If so, you need to also add support for it, not just set it in the installer.
PYTHONUTF8 is supported on Windows already. You can use "set PYTHONUTF8=1" to enable UTF-8 mode.
One last thing: On Linux, you often use the locale coercion feature instead of the assume-UTF-8 feature. (For example, if you’re running a subprocess and want to ensure its stdout is UTF-8…) Is there an equivalent issue for Windows, or a very different but equally important one that needs to be solved differently, or is there just nothing relevant here?
On Windows, there is no way to ensure subprocess to use UTF-8. * Some application always use UTF-8. * Some application always use legacy encoding. * Some application checks GetConsoleOutputCP. (CLI only) * Some application have their own setting for stdout encoding. (e.g. PowerShell Core)
I don't do that. And I don't think we should do this: * It can be used only in Windows 10 1903~ * Setting manifest is harder than setting an environment variable. It is too difficult to opt-inout. * It makes "mbcs" encoding to UTF-8 too. There is no way to use legacy encoding explicitly. So I think UTF-8 mode is better than this Windows feature. -- Inada Naoki <songofacandy@gmail.com>

On 1/10/20, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
The implementation of UTF-8 mode (i.e. -Xutf8) is cross-platform, though I think it could use some tweaking for Windows.
Yes, 3.6+ in Windows defaults to UTF-8 for console I/O and the filesystem encoding. If for some reason you need the legacy behavior, it can be enabled via the following environment variables [1]: PYTHONLEGACYWINDOWSSTDIO and PYTHONLEGACYWINDOWSFSENCODING. Setting PYTHONLEGACYWINDOWSFSENCODING switches the filesystem encoding to "mbcs". Note that this does not use the system MBS (multibyte string) API. Python simply transcodes between UTF-16 and ANSI instead of UTF-8. Currently this setting takes precedence over UTF-8 mode, but I think it should be the other way around. Setting PYTHONLEGACYWINDOWSSTDIO uses the console input codepage for stdin and the console output codepage for stdout and stderr, but only if isatty is true and the process is attached to a console (see _Py_device_encoding in Python/fileutils.c). Otherwise it uses the system ANSI codepage. Note that this setting is currently **broken** in 3.8. In Python/initconfig.c, config_init_stdio_encoding calls config_get_locale_encoding to set config->stdio_encoding. This always uses the system ANSI codepage (e.g. 1252), even for console files for which this choice makes no sense. Combining UTF-8 mode with legacy Windows standard I/O is generally dysfunctional. The result is mojibake, unless the console codepage happens to be UTF-8. I'd prefer UTF-8 mode to take precedence over legacy standard I/O mode and have it imply non-legacy I/O. In both of the above cases, what I'd prefer is for UTF-8 mode to take precedence over legacy modes, i.e. to disable config->legacy_windows_fs_encoding and config->legacy_windows_stdio in the startup configuration. Regarding the MBS API and UTF-8 In Windows 10, it's possible to set the ANSI and OEM codepages to UTF-8 at both the system level (in the system control panel) and the application level (in the application manifest). But many functions are still only available in the WCS (wide-character string) API, such as GetLocaleInfoEx, GetFileInformationByHandleEx, and SetFileInformationByHandle. I don't know whether Microsoft plans to implement MBS wrappers in these cases. If the ANSI codepage is UTF-8, then the MBS file API (e.g. CreateFileA) is basically equivalent to Python's UTF-8 filesystem encoding. There's one exception. Python uses the "surrogatepass" error handler, which allows invalid surrogate codes (i.e. a "Wobbly" WTF-8 encoding). In contrast, the MBS API translates invalid surrogates to the replacement character (U+FFFD). I think Python's choice is more sensible because the WCS file API (e.g. CreateFileW) and filesystem drivers do not verify that strings are valid Unicode. The console uses the system OEM codepage as its default I/O codepage. Setting OEM to UTF-8 (at the system level, not at the application level), or manually setting the codepage to UTF-8 via `chcp.com 65001`, is a potential problem because the console doesn't support reading non-ASCII UTF-8 strings via ReadFile or ReadConsoleA. Prior to Windows 10, it returns an empty string for this case, which looks like EOF. The new console in Windows 10 instead translates each non-ASCII character as a null byte (e.g. "SPĀM" -> "SP\x00M"), which is better but still pretty much useless for reading non-English input. Python 3.6+ is for the most part immune to this. In the default configuration, it uses ReadConsoleW to read UTF-16 instead of relying on the input codepage. (Low-level os.read is not immune to the problem, however, because it is not integrated with the new console I/O implementation.) [1] https://docs.python.org/3/using/cmdline.html#environment-variables

On Sun, Jan 12, 2020 at 9:32 PM Eryk Sun <eryksun@gmail.com> wrote:
UTF-8 mode shouldn't take precedence over legacy FS encoding. Mercurial uses legacy encoding for file paths. They use sys._enablelegacywindowsfsencoding() on Windows. https://www.mercurial-scm.org/repo/hg/rev/8d5489b048b7 Since Mercurial uses binary file almost always, I think the UTF-8 mode doesn't break Mercurial. But I'm not sure. (Note that Mercurial on Python 3 on Windows is still beta.) Regards, -- Inada Naoki <songofacandy@gmail.com>

How do you think?
At the least, I'm in favor of recommending UTF-8 mode in the documentation for Windows users. It seems like it would fit well under the "Configuring Python" section of the page ( https://docs.python.org/3/using/windows.html#configuring-python). I'm undecided on the others, as I don't know what (2) and (3) would specifically entail. As far as (4) goes:
Would you mind elaborating on this point? In particular, what specific dangers/risks might be associated with setting that specific env var during installation? IIRC, the installer already configures a few others by default (I don't recall their names).
But it may make Python startup process more complex...
I would definitely prefer to have a checkbox to configure "PYTHONUTF8" during installation rather than requiring it to be done manually, assuming it can be done safely and effectively across different systems. Not only for the sake of lowering complexity, but also because it takes less time and effort. That can add up significantly when you consider the volume of users. On Fri, Jan 10, 2020 at 6:47 AM Inada Naoki <songofacandy@gmail.com> wrote:

On Sat, Jan 11, 2020 at 11:03 AM Kyle Stanley <aeros167@gmail.com> wrote:
Current header is: Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. I'm proposing adding one more line: UTF-8 mode is disabled. (See https://url.to/utf8mode)
If the Python installer set PYTHONUTF8=1 environment variable, it may affects applications using embeddable Python or py2exe. So it may break applications which assume the default text encoding is "mbcs".
But it may make Python startup process more complex...
I would definitely prefer to have a checkbox to configure "PYTHONUTF8" during installation rather than requiring it to be done manually, assuming it can be done safely and effectively across different systems. Not only for the sake of lowering complexity, but also because it takes less time and effort. That can add up significantly when you consider the volume of users.
Even when we add per-installation config file, we can add it from the Python installer. -- Inada Naoki <songofacandy@gmail.com>

Inada Naoki wrote:
Current header is:
I'm proposing adding one more line:
UTF-8 mode is disabled. (See https://url.to/utf8mode)
Ah, that should be fine. I don't have any issues with (2) and (3) then. Inada Naoki wrote:
If the Python installer set PYTHONUTF8=1 environment variable, it may affects applications using embeddable Python or py2exe.
So it may break applications which assume the default text encoding is "mbcs".
In that case, I think we should hold off on (4) until it's tested for a decent variety of different applications, or at the least consider leaving it off by default (unticked checkbox) with a note of some form that warns users of the possible legacy incompatibility with MBCS. I'll admit though that I have no experience working with any application that implicitly assumes MBCS is the default encoding format, is this a common occurrence in some older applications? I could imagine this being the case in legacy applications that need international encoding, but I have zero idea of how common it is in reality. As a side note, Microsoft recommends not using MBCS for encoding in newer applications in their Visual C++ docs: "MBCS is a legacy technology and is not recommended for new development." ( https://docs.microsoft.com/en-us/cpp/text/support-for-multibyte-character-se... ). This might also be useful for getting a general idea of MBCS expectations when it comes to Windows development, although some of it is specific to C++: https://docs.microsoft.com/en-us/cpp/text/general-mbcs-programming-advice?vi... . On Sun, Jan 12, 2020 at 1:28 AM Inada Naoki <songofacandy@gmail.com> wrote:

Kyle Stanley writes:
At least in East Asia, it is. Eg, my university's student information system for faculty use allows downloading of class lists in CSV format. Until sometime last year it defaulted to Japan's annoying MBCS code page 932 ("Shift JIS"). It still allows Shift JIS, but at least it now defaults to UTF-8. I assume the former default means that the PHBs got less complaints from people whose software assumes UTF-8 than from those whose software assumes Shift JIS until recently. And this is one of the easy migrations, since almost everybody has used Excel exclusively for about 15 years.
As a side note, Microsoft recommends
Thank you for this information, but it's basically irrelevant. Encodings are no longer a technical problem (at least to my non-native eye Unicode applications are now more likely to display Japanese correctly and beautifully than the legacy apps), but rather a cultural and budgetary problem. There is still tons of data in legacy applications, both as text files and in various application data formats, that use legacy encodings (in Japanese, that means MBCS). Sadly, it's not as simple as running "iconv -f shift_jis -t utf-8" on all the .txt files in sight. That WFM (well, I had to do a few .tex and .rst files too ;-), but most people are dependent on Word, Excel, and other application formats, and it's a PITA; VB scripting is a very rare skill except among the (of course overburdened) technical staff. Steve

On Jan 13, 2020, at 19:32, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
But those are binary formats, not something you can just read in as text in Python even if you know the encoding. And surely nobody is extracting the text out of those file formats manually anyway? Unless you’re actually working on a tool like wv or antiword, you just use one of those tools (or the Python wrappers around them), or talk to Word or Word Viewer over COM, and either way you just get Unicode. (And even if you are working on a tool like wv, while PITA is an understatement, the hard part is navigating the insane formats to find and order the text chunks; once you can do that, dealing with UCS2 vs. ANSI vs. ALTANSI chunks and knowing where to find the code pages for the latter two in the structure is the comparatively easy part.) I can think of one place where mbcs would be useful and you can’t just iconv. IIRC, with old-school RTF export you have to unescape and then convert and then reescape? But everything else I remember dealing with is either a binary format you need a library for, or plain text like TXT and CSV.) But even then, can you even rely on mbcs? I know it used to be a problem (early 00s) in many Japanese shops that you had Shift-JIS docs on Windows boxes or in Notes servers or whatever where the default codepage wasn’t Shift-JIS.

Andrew Barnert writes:
"My Name Is Nobody", aka The Exception to Prove the Rule. ;-)
True, but I'm really mostly talking about those "other application formats" such as Ichitaro, email clients I will not dirty the keyboard by typing their names, and more specialized stuff (stats packages, archivers, etc) whose names would not be familiar here.
But even then, can you even rely on mbcs?
No (we are discussing a nation that even today you may run into 5 different "native" encodings in the same day, after all), but on Windows what else were you going to use for default, if not UTF-8?
Never saw one of those. If the box was running Windows, IME the default code page was 932, until the late noughties, when UTF-8 to become common (although file systems were still mostly Shift JIS, to the enjoyment of all who weren't using Python 3 with PEP 393 ;-). If the box was running Unix, the encoding was usually packed EUC-JP, but I never dealt with Notes (banzai! the gods smiled on me). OTOH, I have seen filesystem paths on Sun boxen and in zipfiles with multiple encodings in them. :-) Steve
participants (5)
-
Andrew Barnert
-
Eryk Sun
-
Inada Naoki
-
Kyle Stanley
-
Stephen J. Turnbull