[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Steve Dower steve.dower at python.org
Thu Sep 1 18:28:53 EDT 2016


I'm about to be offline for a few days, so I wanted to get my current 
draft PEPs out for people can read and review.

I don't believe there is a lot of change as a result of either PEP, but 
the impact of what change there is needs to be weighed against the benefits.

If anything, I'm likely to have underplayed the impact of this change 
(though I've had a *lot* of support for this one). Just stating my 
biases up-front - take it as you wish.

See https://bugs.python.org/issue1602 for the current proposed patch for 
this PEP. I will likely update it after my upcoming flights, but it's in 
pretty good shape right now.

Cheers,
Steve


---
https://github.com/python/peps/blob/master/pep-0528.txt
---

PEP: 528
Title: Change Windows console encoding to UTF-8
Version: $Revision$
Last-Modified: $Date$
Author: Steve Dower <steve.dower at python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 27-Aug-2016
Post-History: 01-Sep-2016

Abstract
========

Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have 
been long
discouraged in favor of the UTF-16 APIs. Within the operating system, 
all text
is represented as UTF-16, and the ANSI APIs perform encoding and 
decoding using
the active code page.

This PEP proposes changing the default standard stream implementation on 
Windows
to use the Unicode APIs. This will allow users to print and input the 
full range
of Unicode characters at the default Windows console. This also requires a
subtle change to how the tokenizer parses text from readline hooks, that 
should
have no backwards compatibility issues.

Specific Changes
================

Add _io.WindowsConsoleIO
------------------------

Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
representing standard input, output and error. We add a new class 
(implemented
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the 
Windows
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.

This class will be used when the legacy-mode flag is not in effect, when 
opening
a standard stream by file descriptor and the stream is a console buffer 
rather
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is 
today.

This is a raw (bytes) IO class that requires text to be passed encoded with
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
Similarly, bytes read from the class will be provided by the operating 
system as
utf-16-le and converted into utf-8 when returned to Python.

The use of an ASCII compatible encoding is required to maintain 
compatibility
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII 
bytes to
the standard streams (for example, [process_stdinreader.py]_). Code that 
assumes
a particular encoding for the standard streams other than ASCII will likely
break.

Add _PyOS_WindowsConsoleReadline
--------------------------------

To allow Unicode entry at the interactive prompt, a new readline hook is
required. The existing ``PyOS_StdioReadline`` function will delegate to 
the new
``_PyOS_WindowsConsoleReadline`` function when reading from a file 
descriptor
that is a console buffer and the legacy-mode flag is not in effect (the 
logic
should be identical to above).

Since the readline interface is required to return an 8-bit encoded 
string with
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function 
transcodes from
utf-16-le as read from the operating system into utf-8.

The function ``PyRun_InteractiveOneObject`` which currently obtains the 
encoding
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in 
effect.
This may require readline hooks to change their encodings to utf-8, or to
require legacy-mode for correct behaviour.

Add legacy mode
---------------

Launching Python with the environment variable 
``PYTHONLEGACYWINDOWSSTDIO`` set
will enable the legacy-mode flag, which completely restores the previous
behaviour.

Alternative Approaches
======================

The ``win_unicode_console`` package [win_unicode_console]_ is a pure-Python
alternative to changing the default behaviour of the console.

Code that may break
===================

The following code patterns may break or see different behaviour as a 
result of
this change. All of these code samples require explicitly choosing to 
use a raw
file object in place of a more convenient wrapper that would prevent any 
visible
change.

Assuming stdin/stdout encoding
------------------------------

Code that assumes that the encoding required by ``sys.stdin.buffer`` or
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may 
currently be
working by chance, but could encounter issues under this change. For 
example::

     sys.stdout.buffer.write(text.encode('mbcs'))
     r = sys.stdin.buffer.read(16).decode('cp437')

To correct this code, the encoding specified on the ``TextIOWrapper`` 
should be
used, either implicitly or explicitly::

     # Fix 1: Use wrapper correctly
     sys.stdout.write(text)
     r = sys.stdin.read(16)

     # Fix 2: Use encoding explicitly
     sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
     r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)

Incorrectly using the raw object
--------------------------------

Code that uses the raw IO object and does not correctly handle partial 
reads and
writes may be affected. This is particularly important for reads, where the
number of characters read will never exceed one-fourth of the number of 
bytes
allowed, as there is no feasible way to prevent input from encoding as much
longer utf-8 strings::

     >>> stdin = open(sys.stdin.fileno(), 'rb')
     >>> data = stdin.raw.read(15)
     abcdefghijklm
     b'abc'
     # data contains at most 3 characters, and never more than 12 bytes
     # error, as "defghijklm\r\n" is passed to the interactive prompt

To correct this code, the buffered reader/writer should be used, or the 
caller
should continue reading until its buffer is full.::

     # Fix 1: Use the buffered reader/writer
     >>> stdin = open(sys.stdin.fileno(), 'rb')
     >>> data = stdin.read(15)
     abcedfghijklm
     b'abcdefghijklm\r\n'

     # Fix 2: Loop until enough bytes have been read
     >>> stdin = open(sys.stdin.fileno(), 'rb')
     >>> b = b''
     >>> while len(b) < 15:
     ... b += stdin.raw.read(15)
     abcedfghijklm
     b'abcdefghijklm\r\n'

Copyright
=========

This document has been placed in the public domain.

References
==========

.. [process_stdinreader.py] Twisted's process_stdinreader.py
 
(https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py)
.. [win_unicode_console] win_unicode_console package
    (https://pypi.org/project/win_unicode_console/)


More information about the Python-Dev mailing list