[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Wed Dec 6 05:34:59 EST 2017

Hi Naoki,

2017-12-06 5:07 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.

The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

In the very first version of my PEP/idea, I wanted to use
UTF-8/strict. But then I started to play with the implementation and I
got many "practical" issues. Using UTF-8/strict, you quickly get
encoding errors. For example, you become unable to read undecodable
bytes from stdin. stdin.read() only gives you an error, without
letting you decide how to handle these "invalid" data. Same issue with
stdout.

Compare encodings of the UTF-8 mode and the Strict UTF-8 Mode:
https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler

I tried to summarize all these kinds of issues in the second short
subsection of the rationale:
https://www.python.org/dev/peps/pep-0540/#passthough-undecodable-bytes-surrogateescape

In the old long version of the PEP, I tried to explain UTF-8/strict
issues with very concrete examples, the removed "Use Cases" section:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490

Tell me if I should rephrase the rationale of the PEP 540 to better
justify the usage of surrogateescape.

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
surrogateescape, or backslashreplace for stderr, or surrogatepass for
fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
the PEP title would be too long, no? :-)

> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.

When open() in used in text mode to read "binary data", usually the
developer would only notify when getting the POSIX locale (ASCII
encoding). But the PEP 538 already changed that by using the C.UTF-8
locale (and so the UTF-8 encoding, instead of the ASCII encoding).

I'm not sure that locales are the best way to detect such class of
bytes. I suggest to use -b or -bb option to detect such bugs without
having to care of the locale.

> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

Usually users expect that Python 3 "just works" and don't bother them
with the locale (thay nobody understands).

The old version of the PEP contains a long list of issues:
https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986

I already replaced the strict error handler with surrogateescape for
sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
https://bugs.python.org/issue19977

For the rationale, read for example these comments:

* https://bugs.python.org/issue19846#msg205727 "As I would state it,
the problem is that python's boundary with the OS is not yet uniform.
(...) Note that currently, input() and sys.stdin.read() won't read
undecodable data so this is somewhat symmetrical but it seems to me
that saying "everything that interfaces with the OS except the
standard streams will use surrogateescape on undecodable bytes" is
drawing a line in an unintuitive location."

* https://bugs.python.org/issue19977#msg206141 "My impression was that
python3 was supposed to help get rid of UnicodeError tracebacks, not
mojibake.  If mojibake was the problem then we should never have gone
down the surrogateescape path for input."

* https://bugs.python.org/issue19846#msg205646 "For example I'm using
[LANG=C] for testcases to set the language uncomplicated to english."

In bug reports, to get the user expectations, just ignore all core
developers comments :-)

Users set the locale to C to get messages in english and still expects
"Unicode" to work properly.

Only Python 3 is so strict about encodings. Most other programming
languages, like Python 2, "just works", since they process data as
bytes.

Victor