[Python-ideas] PEP 540: Add a new UTF-8 mode

Thu Jan 5 11:50:37 EST 2017

> https://www.python.org/dev/peps/pep-0540/

I read the PEP 538, PEP 540, and issues related to switching to UTF-8. At
least, I can say one thing: people have different points of view :-)

To understand why people disagree, I tried to categorize the different point of
views and Python expectations:

"UNIX mode":

   Python 2 developers and long UNIX users expect that their code "just
   works". They like Python 3 features, but Python 3 annoy them with
   various encoding errors. The expectation is to be able to read data
   encoded to various incompatible encodings and write it into stdout or
   a text file. In short, mojibake is not a bug but a feature!

"Strict Unicode mode" for real Unicode fans:

   Python 3 is strict and it's a good thing! Strict codec helps to
   detect very early bugs in the code. These developers understand very
   well Unicode and are able to fix complex encoding issues. Mojibake is
   a no-no for them.

Python 3.6 is not exactly in the first or the later category: "it
depends".

To read data from the operating system, Python 3.6 behaves in "UNIX
mode": os.listdir() *does* return invalid filenames, it uses a funny
encoding using surrogates.

To write data back to the operating system, Python 3.6 wears its
"Unicode nazi" hat and becomes strict. It's no more possible to write
data from from the operating system back to the operating system.
Writing a filename read from os.listdir() into stdout or into a text
file fails with an encode error.

Subtle behaviour: since Python 3.6, with the POSIX locale, Python 3.6
uses the "UNIX mode" but only to write into stdout. It's possible to
write a filename into stdout, but not into a text file.

In its current shame, my PEP 540 leaves Python default unchanged, but
adds two modes: UTF-8 and UTF-8 strict. The UTF-8 mode is more or less
the UNIX mode generalized for all inputs and outputs: mojibake is a
feature, just pass bytes unchanged. The UTF-8 strict mode is more
extreme that the current "Strict Unicode mode" since it fails on
*decoding* data from the operating system.

Now that I have a better view of what we have and what we want, the
question is if the default behaviour should be changed and if yes,
how.

Nick's PEP 538 does exactly move to the "UNIX mode" (open() doesn't
use surrogateescape) nor the "Strict Unicode mode" (fsdecode() still
uses surrogateescape), it's still in a grey area. Maybe Nick can
elaborate the use case or update his PEP?

I guess that all users and most developers are more in the "UNIX mode"
camp. *If* we want to change the default, I suggest to use the "UNIX
mode" by default.

The question is if someone relies/likes on the current Python 3.6
behaviour: reading "just works", writing is strict.

If you like this behaviour, what do you think of the tiny Python 3.6
change: use surrogateescape for stdout when the locale is POSIX.

Victor