[Python-ideas] PEP 540: Add a new UTF-8 mode

Victor Stinner victor.stinner at gmail.com
Thu Jan 12 10:12:07 EST 2017


2017-01-12 1:23 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
> I'm ±0 to surrogateescape by default.  I feel +1 for stdout and -1 for stdin.

The use case is to be able to write a Python 3 program which works
work UNIX pipes without failing with encoding errors:
https://www.python.org/dev/peps/pep-0540/#producer-consumer-model-using-pipes

If you want something stricter, there is the UTF-8 Strict mode which
prevent mojibake everywhere. I'm not sure that the UTF-8 Strict mode
is really useful. When I implemented it, I quickly understood that
using strict *everywhere* is just a deadend: it would fail in too many
places.
https://www.python.org/dev/peps/pep-0540/#use-the-strict-error-handler-for-operating-system-data

I'm not even sure yet that a Python 3 with stdin using strict is "usable".


> In output case, surrogateescape is weaker than strict, but it only allows
> surrgateescaped binary.  If program carefully use surrogateescaped decode,
> surrogateescape on stdout is safe enough.

What do you mean that "carefully use surrogateescaped decode"?

The rationale for using surrogateescape on stdout is to support this use case:
https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout


> On the other hand, surrogateescape is very weak for input.  It accepts
> arbitrary bytes.
> It should be used carefully.

In my experience with the Python bug tracker, almost nobody
understands Unicode and locales. For the "Producer-consumer model
using pipes" use case, encoding issues of Python 3.6 can be a blocker
issue. Some developers may prefer a different programming language
which doesn't bother them with Unicode: basicall, *all* other
programming languages, no?


> But I agree different encoding handler between stdin/stdout is not beautiful.
> That's why I'm ±0.

That's why there are two modes: UTF-8 and UTF-8 Strict. But I'm not
100% sure yet, on which encodings and error handlers should be used
;-) I started to play with my PEP 540 implementation. I already had to
update the PEP 540 and its implementation for Windows. On Windows,
os.fsdecode/fsencode now uses surrogatepass, not surrogateescape
(Python 3.5 uses strict on Windows).

Victor


More information about the Python-ideas mailing list