Re: [Python-ideas] Fix default encodings on Windows

10 Aug 2016

      On Wed, Aug 10, 2016 at 6:10 PM, Steve Dower  wrote:
...
Similarly, locale.getpreferredencoding() on Windows returns a legacy value -
the user's active code page - which should generally not be used for any
reason. The one exception is as a default encoding for opening files when no
other information is available (e.g. a Unicode BOM or explicit encoding
argument). BOMs are very common on Windows, since the default assumption is
nearly always a bad idea.
The CRT doesn't allow UTF-8 as a locale encoding because Windows
itself doesn't allow this. So locale.getpreferredencoding() can't
change, but in practice it can be ignored.

Speaking of locale, Windows Python should call setlocale(LC_CTYPE, "")
in pylifecycle.c in order to work around an inconsistency between
LC_TIME and LC_CTYPE in the the default "C" locale. The former is ANSI
while the latter is effectively Latin-1, which leads to mojibake in
time.tzname and elsewhere. Calling setlocale(LC_CTYPE, "") is already
done on most Unix systems, so this would actually improve
cross-platform consistency.
...
Finally, the encoding of stdin, stdout and stderr are currently (correctly)
inferred from the encoding of the console window that Python is attached to.
However, this is typically a codepage that is different from the system
codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users
are starting Python from a console, they can use "chcp 65001" first to
switch to UTF-8, and then *most* functionality works (input() has some
issues, but those can be fixed with a slight rewrite and possibly breaking
readline hooks).
Using codepage 65001 for output is broken prior to Windows 8 because
WriteFile/WriteConsoleA returns (as an output parameter) the number of
decoded UTF-16 codepoints instead of the number of bytes written,
which makes a buffered writer repeatedly write garbage at the end of
each write in proportion to the number of non-ASCII characters. This
can be worked around by decoding to get the UTF-16 size before each
write, or by just blindly assuming that a console write always
succeeds in writing the entire buffer. In this case the console should
be detected by GetConsoleMode(). isatty() isn't right for this since
it's true for all character devices, which includes NUL among others.

Codepage 65001 is broken for non-ASCII input (via
ReadFile/ReadConsoleA) in all versions of Windows that I've tested,
including Windows 10. By attaching a debugger to conhost.exe you can
see how it fails in WideCharToMultiByte because it assumes one byte
per character. If you try to read 10 bytes, it assumes you're trying
to read 10 UTF-16 'characters' into a 10 byte buffer, which fails for
UTF-8 when even a single non-ASCII character is read. The
ReadFile/ReadConsoleA call returns that it successfully read 0 bytes,
which is interpreted as EOF. This cannot be worked around. The only
way to read the full range of Unicode from the console is via the
wide-character APIs ReadConsoleW and ReadConsoleInputW.

IMO, Python needs a C implementation of the win_unicode_console
module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
Python code never has to work directly with UTF-16 encoded text.

Re: [Python-ideas] Fix default encodings on Windows

eryk sun