[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Thu Dec 7 19:48:25 EST 2017

2017-12-08 0:26 GMT+01:00 Guido van Rossum <guido at python.org>:
> You will quickly get decoding errors, and that is INADA's point. (Unless you
> use encoding='Latin-1'.) His worry is that the surrogateescape error handler
> makes it so that you won't get decoding errors, and then the failure mode is
> much harder to debug.

Hum, my question was more to know if Python fails because of an
operation failing with strings whereas bytes were expected, or if
Python fails with a decoding error... But now I'm not sure aynmore
that this level of detail really matters.

Let me think out loud. To explain unicode issues, I like to use
filenames, since it's something that users view commonly, handle
directly and can modify (and so enter many non-ASCII characters like
diacritics and emojis ;-)).

Filenames can be found on the command line, in environment variables
(PYTHONSTARTUP), stdin (read a list of files from stdin), stdout
(write the list of files into stdout), but also in text files (the
Mercurial "makefile problem).

I consider that the command line and environment variables should
"just work" and so use surrogateescape. It would be too annoying to
not even be able to *start* Python because of an Unicode error. For
example, it wouldn't be easy to identify which environment variable
causes the issue. Hopefully, the UTF-8 doesn't change anything here:
surrogateescape is already used since Python 3.3 for the command line
and environment variables.

For stdin/stdout, I think that the main motivation here is to write
Unix command line tools using Python 3: pass-through undecodable bytes
without bugging the user with Unicode. Users don't use stdin and
stdout as regular files, they are more used as pipes to pass data
between programs with the Unix pipe in a shell like "producer |
consumer". Sometimes stdout is redirected to a file, but I consider
that it is expected to behave as a pipe and the regular TTY stdout.
IMHO we are still in the safe surrogateescape area (for the specific
case of the UTF-8 mode).

Ok, now comes the real question, open().

For open(), I used the example of a code snippet *writing* the content
of a directory (os.listdir) into a text file. Another example is to
read filenames from a text files but pass-through undecodable bytes
thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary
files and Python should somehow fail badly to notify the developer of
their mistake.

If I should make a choice between the two categories of usage of
open(), "read undecodable bytes in UTF-8 from a text file" versus
"misuse open() on binary file", I expect that the later is more common
that that open() shouldn't use surrogateescape by default.

While stdin and stdout are usually associated to Unix pipes and Unix
tools working on bytes, files are more commonly associated to
important data that must not be lost nor corrupted. Python is expected
to "help" the developer to use the proper options to read content from
a file and to write content into a file. So I understand that open()
should use the "strict" error handler in the UTF-8 mode, rather than
"surrogateescape".

I can survive to this "tiny" change to my PEP. I just posted a 3rd
version of my PEP where open() error handler remains strict (is no
more changed by the PEP).

Victor