[Tutor] logging to cmd.exe

eryk sun eryksun at gmail.com
Tue Sep 26 10:38:33 EDT 2017


> cmd.exe can use cp65001 aka utf8???

CMD is a Unicode application that for the most part uses WinAPI
wide-character functions, including the console API functions (as does
Python 3.6+). There are a few exceptions. CMD uses the console
codepage when decoding batch files (line by line, so you can change
the codepage in the middle of a batch script), when writing output
from its internal commands (e.g. dir) to pipes and files (the /u
option overrides this), and when reading output from programs in a
`FOR /F` loop.

> Why does cmd.exe still use cp850?

In the above cases CMD uses the active console input or output
codepage, which defaults to the system locale's OEM codepage. If it's
not attached to a console (i.e. when run as a DETACHED_PROCESS), CMD
uses the ANSI codepage in these cases.

Anyway, you appear to be talking about the Windows console, which
people often confuse with CMD. Programs that use command-line
interfaces (CLIs) and text user interfaces (TUIs), such as classic
system shells, are clients of a given console or terminal interface. A
TUI application typically is tightly integrated with the console or
terminal interface (e.g. a curses application), while a CLI
application typically just uses standard I/O (stdin, stdout, stderr).
Both cmd.exe and python.exe are Windows console clients. There's
nothing special about cmd.exe in this regard.

Now, there are a couple of significant problems with using codepage
65001 in the Windows console.

Prior to Windows 8, WriteFile and WriteConsoleA return the number of
decoded wide characters written to the console, which is a bug because
they're supposed to return the number of bytes written. It's not a
problem so long as there's a one-to-mapping between bytes and
characters in the console's output codepage. But UTF-8 can have up to
4 bytes per character. This misleads buffered writers such as C FILE
streams and Python 3's io module, which in turn causes gibberish to be
printed after every write of a string that includes non-ASCII
characters.

Prior to Windows 10, with codepage 65001, reading input from the
console via ReadConsole or ReadConsoleA fails if the input has
non-ASCII characters. It gets reported as a successful read of zero
bytes. This causes Python to think it's at EOF, so the REPL quits (as
if Ctrl+Z had been entered) and input() raises EOFError.

Even in Windows 10, while the entire read doesn't fail, it's not much
better. It replaces non-ASCII characters with NUL bytes. For example,
in Windows 10.0.15063:

    >>> os.read(0, 100)
    abcαβγdef
    b'abc\x00\x00\x00def\r\n'

Microsoft is gradually working on fixing UTF-8 support in the console
(well, two developers are working on it). They appear to have fixed it
at least for the private console APIs used by the new Linux subsystem
in Windows 10:

    Python 3.5.2 (default, Nov 17 2016, 17:05:23)
    [GCC 5.4.0 20160609] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> s = os.read(0, 100)
    abcαβγdef
    >>> s
    b'abc\xce\xb1\xce\xb2\xce\xb3def\n'
    >>> s.decode()
    'abcαβγdef\n'

Maybe it's fixed in the Windows API in an upcoming update. But still,
there are a lot of Windows 7 and 8 systems out there, for which
codepage 65001 in the console will remain broken.

> I always thought 65001 was not a 'real' codepage, even though some locales (e.g. Georgia) use it [1].

Codepage 65001 isn't used by any system locale as the legacy ANSI or
OEM codepage. The console allows it probably because no one thought to
prevent using it in the late 1990s. It has been buggy for two decades.

Moodle seems to have special support for using UTF-8 with Georgian.
But as far as Windows is concerned, there is no legacy codepage for
Georgian. For example:

    import ctypes
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    LD_ACP = LOCALE_IDEFAULTANSICODEPAGE = 0x00001004
    acp = (ctypes.c_wchar * 6)()

    >>> kernel32.GetLocaleInfoEx('ka-GE', LD_ACP, acp, 6)
    2
    >>> acp.value
    '0'

A value of zero here means no ANSI codepage is defined [1]:

    If no ANSI code page is available, only Unicode can be used for
    the locale. In this case, the value is CP_ACP (0). Such a locale
    cannot be set as the system locale. Applications that do not
    support Unicode do not work correctly with locales marked as
    "Unicode only".

Georgian (ka-GE) is a Unicode-only locale [2] that cannot be set as
the system locale.

[1]: https://msdn.microsoft.com/en-us/library/dd373761
[2]: https://msdn.microsoft.com/en-us/library/ms930130.aspx


More information about the Tutor mailing list