[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Wed Dec 6 09:02:16 EST 2017

>> And I have one worrying point.
>> With UTF-8 mode, open()'s default encoding/error handler is
>> UTF-8/surrogateescape.
>
> The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

Yes, but as I said, I cares about not experienced developer
who doesn't know what UTF-8 mode is.

>
> In the very first version of my PEP/idea, I wanted to use
> UTF-8/strict. But then I started to play with the implementation and I
> got many "practical" issues. Using UTF-8/strict, you quickly get
> encoding errors. For example, you become unable to read undecodable
> bytes from stdin. stdin.read() only gives you an error, without
> letting you decide how to handle these "invalid" data. Same issue with
> stdout.
>

I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error
https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

I care only about builtin open()'s behavior.
PEP 538 doesn't change default error handler of open().

I think PEP 538 and PEP 540 should behave almost identical except
changing locale
or not.  So I need very strong reason if PEP 540 changes default error
handler of open().

> In the old long version of the PEP, I tried to explain UTF-8/strict
> issues with very concrete examples, the removed "Use Cases" section:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
>
> Tell me if I should rephrase the rationale of the PEP 540 to better
> justify the usage of surrogateescape.

OK, "List a directory into a text file" example demonstrates why surrogateescape
is used for open().  If os.listdir() returns surrogateescpaed data,
file.write() will be
fail.
All other examples are about stdio.

But we should achieve good balance between correctness and usability of
default behavior.

>
> Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> surrogateescape, or backslashreplace for stderr, or surrogatepass for
> fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> the PEP title would be too long, no? :-)
>

I feel short name is enough.

>
>> And opening binary file without "b" option is very common mistake of new
>> developers.  If default error handler is surrogateescape, they lose a chance
>> to notice their bug.
>
> When open() in used in text mode to read "binary data", usually the
> developer would only notify when getting the POSIX locale (ASCII
> encoding). But the PEP 538 already changed that by using the C.UTF-8
> locale (and so the UTF-8 encoding, instead of the ASCII encoding).
>

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
UTF-8/surrogateescape.

For example, this code raise UnicodeDecodeError with PEP 538 if the
file is JPEG file.

    with open(fn) as f:
        f.read()

> I'm not sure that locales are the best way to detect such class of
> bytes. I suggest to use -b or -bb option to detect such bugs without
> having to care of the locale.
>

But many new developers doesn't use/know -b or -bb option.

>
>> On the other hand, it helps some use cases when user want byte-transparent
>> behavior, without modifying code to use "surrogateescape" explicitly.
>>
>> Which is more important scenario?  Anyone has opinion about it?
>> Are there any rationals and use cases I missing?
>
> Usually users expect that Python 3 "just works" and don't bother them
> with the locale (thay nobody understands).
>
> The old version of the PEP contains a long list of issues:
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
>
> I already replaced the strict error handler with surrogateescape for
> sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> https://bugs.python.org/issue19977
>
> For the rationale, read for example these comments:
>
[snip]

OK, I'll read them and think again about open()'s default behavior.
But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.

Regards,