[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Thu Dec 7 17:57:48 EST 2017

While I'm not strongly convinced that open() error handler must be
changed for surrogateescape, first I would like to make sure that it's
really a very bad idea because changing it :-)

2017-12-07 7:49 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
> I just came up with crazy idea; changing default error handler of open()
> to "surrogateescape" only when open mode is "w" or "a".

The idea is tempting but I'm not sure that it's a good idea. Moreover,
what about "r+" and "w+" modes?

I dislike getting a different behaviour for inputs and outputs. The
motivation for surrogateescape is to "pass through" undecodable bytes:
you need to handle them on the input side and on the output side.

That's why I decided to not only change sys.stdin error handler to
surrogateescape for the POSIX locale, but also sys.stdout:
https://bugs.python.org/issue19977

> When reading, "surrogateescape" error handler is dangerous because
> it can produce arbitrary broken unicode string by mistake.

I'm fine with that. I wouldn't say that it's the purpose of the PEP,
but sadly it's an expected, known and documented side effect.

You get the same behaviour with Unix command line tools and most
Python 2 applications (processing data as bytes). Nothing new under
the sun.

The PEP 540 allows users to write applications behaving like Unix
tools/Python 2 with the power of the Python 3 language and stdlib.

Again, use the Strict UTF8 mode if you prioritize *correctness* over
*usability*.

Honestly, I'm not even sure that the Strict UTF-8 mode is *usable* in
practice, since we are all surrounded by old documents encoded to
various "legacy" encodings (where legay means: "not UTF-8", like
Latin1 or ShiftJIS). The first non-ASCII character which is not
encoded to UTF-8 is going to "crash" the application (big traceback
with an unicode error).

Maybe the problem is the feature name: "UTF-8 mode". Users may think
to "strict" when they read "UTF-8", since UTF-8 is known to be a
strict encoding. For example, UTF-8 is much stricter than latin1 which
is unable to tell if a document was encoded latin1 or whatever else.
UTF-8 is able to tell if a document was actually encoded to UTF-8 or
not, thanks to the design of the encoding itself.

> And it doesn't allow following code:
>
>     with open("image.jpg", "r") as f:  # Binary data, not UTF-8
>         return f.read()

Using a JPEG image, the example is obviously wrong.

But using surrogateescape on open() is written to read *text files*
which are mostly correctly encoded to UTF-8, except a few bytes.

I'm not sure how to explain the issue. The Mercurial wiki page has a
good example of this issue that they call the "Makefile problem":
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

While it's not exactly the discussed issue, it gives you an issue of
the kind of issues that you have when you use open(filename,
encoding="utf-8", errors="strict") versus open(filename,
encoding="utf-8", errors="surrogateescape")

> I'm not sure about this is good idea.  And I don't know when is good for
> changing write error handler; only when PEP 538 or PEP 540 is used?
> Or always when os.fsencoding() is UTF-8?
>
> Any thoughts?

The PEP 538 doesn't affect the error handler. The PEP 540 only changes
the error handler for the POSIX locale, it's a deliberate choice. The
PEP 538 is only enabled for the POSIX locale, and the PEP 540 will
also be enabled by default by this locale.

I dislike the idea of chaning the error handler if the filesystem
encoding is UTF-8. The UTF-8 mode must be enabled explicitly on
purpose. The reduce any risk of regression, and prepare users who
enable it for any potential issue.

Victor