[Tutor] Changing the interpreter prompt symbol from ">>>" to ???

Sun Mar 13 14:58:46 EDT 2016

On Sun, Mar 13, 2016 at 3:14 AM, Albert-Jan Roskam
<sjeik_appie at hotmail.com> wrote:
> I thought that utf-8 (cp65001) is by definition (or by design?) impossible
> for console output in windows? Aren't there "w" (wide) versions of functions
> that do accept utf-8?

The wide-character API works with the native Windows character
encoding, UTF-16. Except the console is a bit 'special'. A surrogate
pair (e.g. a non-BMP emoji) appears as 2 box characters, but you can
copy it from the console to a rich text application, and it renders
normally. The console also doesn't support variable-width fonts for
mixing narrow and wide (East Asian) glyphs on the same screen. If that
matters, there's a program called ConEmu that hides the console and
proxies its screen and input buffers to drive an improved interface
that has flexible font support, ANSI/VT100 terminal emulation, and
tabs. If you pair that with win_unicode_console, it's almost as good
as a Linux terminal, but the number of hoops you have to go through to
make it all work is too complicated.

Some people try to use UTF-8 (codepage 65001) in the ANSI API --
ReadConsoleA/ReadFile and WriteConsoleA/WriteFile. But the console's
UTF-8 support is dysfunctional. It's not designed to handle it.

In Windows 7, WriteFile calls WriteConsoleA, which decodes the buffer
to UTF-16 using the current codepage and returns the number of UTF-16
'characters' written instead of the number of bytes. This confuses
buffered writers. Say it writes a 20-byte UTF-8 string with 2 bytes
per character. WriteFile returns that it successfully wrote 10
characters, so the buffered writer tries to write the last 10 bytes
again. This leads to a trail of garbage text written after every
write.

When a program reads from the console using ReadFile or ReadConsoleA,
the console's input buffer has to be encoded to the target codepage.
It assumes that an ANSI character is 1 byte, so if you try to read N
bytes, it tries to encode N characters. This fails for non-ASCII
UTF-8, which has 2 to 4 bytes per character. However, it won't
decrease the number of characters to fit in the N byte buffer. In the
API the argument is named "nNumberOfCharsToRead", and they're sticking
to that literally. The result is that 0 bytes are read, which is
interpreted as EOF. So the REPL will quit, and input() will raise
EOFError.