
On Wed, Aug 10, 2016 at 6:10 PM, Steve Dower steve.dower@python.org wrote:
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
The CRT doesn't allow UTF-8 as a locale encoding because Windows itself doesn't allow this. So locale.getpreferredencoding() can't change, but in practice it can be ignored.
Speaking of locale, Windows Python should call setlocale(LC_CTYPE, "") in pylifecycle.c in order to work around an inconsistency between LC_TIME and LC_CTYPE in the the default "C" locale. The former is ANSI while the latter is effectively Latin-1, which leads to mojibake in time.tzname and elsewhere. Calling setlocale(LC_CTYPE, "") is already done on most Unix systems, so this would actually improve cross-platform consistency.
Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).
Using codepage 65001 for output is broken prior to Windows 8 because WriteFile/WriteConsoleA returns (as an output parameter) the number of decoded UTF-16 codepoints instead of the number of bytes written, which makes a buffered writer repeatedly write garbage at the end of each write in proportion to the number of non-ASCII characters. This can be worked around by decoding to get the UTF-16 size before each write, or by just blindly assuming that a console write always succeeds in writing the entire buffer. In this case the console should be detected by GetConsoleMode(). isatty() isn't right for this since it's true for all character devices, which includes NUL among others.
Codepage 65001 is broken for non-ASCII input (via ReadFile/ReadConsoleA) in all versions of Windows that I've tested, including Windows 10. By attaching a debugger to conhost.exe you can see how it fails in WideCharToMultiByte because it assumes one byte per character. If you try to read 10 bytes, it assumes you're trying to read 10 UTF-16 'characters' into a 10 byte buffer, which fails for UTF-8 when even a single non-ASCII character is read. The ReadFile/ReadConsoleA call returns that it successfully read 0 bytes, which is interpreted as EOF. This cannot be worked around. The only way to read the full range of Unicode from the console is via the wide-character APIs ReadConsoleW and ReadConsoleInputW.
IMO, Python needs a C implementation of the win_unicode_console module, using the wide-character APIs ReadConsoleW and WriteConsoleW. Note that this sets sys.std*.encoding as UTF-8 and transcodes, so Python code never has to work directly with UTF-16 encoded text.