<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, 6 Dec 2017 at 06:10 INADA Naoki <<a href="mailto:songofacandy@gmail.com">songofacandy@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">>> And I have one worrying point.<br>

>> With UTF-8 mode, open()'s default encoding/error handler is<br>

>> UTF-8/surrogateescape.<br>

><br>

> The Strict UTF-8 Mode is for you if you prioritize correctness over usability.<br>

<br>

Yes, but as I said, I cares about not experienced developer<br>

who doesn't know what UTF-8 mode is.<br>

<br>

><br>

> In the very first version of my PEP/idea, I wanted to use<br>

> UTF-8/strict. But then I started to play with the implementation and I<br>

> got many "practical" issues. Using UTF-8/strict, you quickly get<br>

> encoding errors. For example, you become unable to read undecodable<br>

> bytes from stdin. stdin.read() only gives you an error, without<br>

> letting you decide how to handle these "invalid" data. Same issue with<br>

> stdout.<br>

><br>

<br>

I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error<br>

<a href="https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams" rel="noreferrer" target="_blank">https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams</a><br>

<br>

I care only about builtin open()'s behavior.<br>

PEP 538 doesn't change default error handler of open().<br>

<br>

I think PEP 538 and PEP 540 should behave almost identical except<br>

changing locale<br>

or not.  So I need very strong reason if PEP 540 changes default error<br>

handler of open().<br></blockquote><div><br></div><div>I don't have enough locale experience to weigh in as an expert, but I already was leaning towards INADA-san's logic of not wanting to change open() and this makes me really not want to change it.</div><div><br></div><div>-Brett<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

> In the old long version of the PEP, I tried to explain UTF-8/strict<br>

> issues with very concrete examples, the removed "Use Cases" section:<br>

> <a href="https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490" rel="noreferrer" target="_blank">https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490</a><br>

><br>

> Tell me if I should rephrase the rationale of the PEP 540 to better<br>

> justify the usage of surrogateescape.<br>

<br>

OK, "List a directory into a text file" example demonstrates why surrogateescape<br>

is used for open().  If os.listdir() returns surrogateescpaed data,<br>

file.write() will be<br>

fail.<br>

All other examples are about stdio.<br>

<br>

But we should achieve good balance between correctness and usability of<br>

default behavior.<br>

<br>

><br>

> Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with<br>

> surrogateescape, or backslashreplace for stderr, or surrogatepass for<br>

> fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But<br>

> the PEP title would be too long, no? :-)<br>

><br>

<br>

I feel short name is enough.<br>

<br>

><br>

>> And opening binary file without "b" option is very common mistake of new<br>

>> developers.  If default error handler is surrogateescape, they lose a chance<br>

>> to notice their bug.<br>

><br>

> When open() in used in text mode to read "binary data", usually the<br>

> developer would only notify when getting the POSIX locale (ASCII<br>

> encoding). But the PEP 538 already changed that by using the C.UTF-8<br>

> locale (and so the UTF-8 encoding, instead of the ASCII encoding).<br>

><br>

<br>

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not<br>

UTF-8/surrogateescape.<br>

<br>

For example, this code raise UnicodeDecodeError with PEP 538 if the<br>

file is JPEG file.<br>

<br>

    with open(fn) as f:<br>

        f.read()<br>

<br>

<br>

> I'm not sure that locales are the best way to detect such class of<br>

> bytes. I suggest to use -b or -bb option to detect such bugs without<br>

> having to care of the locale.<br>

><br>

<br>

But many new developers doesn't use/know -b or -bb option.<br>

<br>

><br>

>> On the other hand, it helps some use cases when user want byte-transparent<br>

>> behavior, without modifying code to use "surrogateescape" explicitly.<br>

>><br>

>> Which is more important scenario?  Anyone has opinion about it?<br>

>> Are there any rationals and use cases I missing?<br>

><br>

> Usually users expect that Python 3 "just works" and don't bother them<br>

> with the locale (thay nobody understands).<br>

><br>

> The old version of the PEP contains a long list of issues:<br>

> <a href="https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986" rel="noreferrer" target="_blank">https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986</a><br>

><br>

> I already replaced the strict error handler with surrogateescape for<br>

> sys.stdin and sys.stdout on the POSIX locale in Python 3.5:<br>

> <a href="https://bugs.python.org/issue19977" rel="noreferrer" target="_blank">https://bugs.python.org/issue19977</a><br>

><br>

> For the rationale, read for example these comments:<br>

><br>

[snip]<br>

<br>

OK, I'll read them and think again about open()'s default behavior.<br>

But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.<br>

<br>

Regards,<br>

_______________________________________________<br>

Python-Dev mailing list<br>

<a href="mailto:Python-Dev@python.org" target="_blank">Python-Dev@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-dev" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-dev</a><br>

Unsubscribe: <a href="https://mail.python.org/mailman/options/python-dev/brett%40python.org" rel="noreferrer" target="_blank">https://mail.python.org/mailman/options/python-dev/brett%40python.org</a><br>

</blockquote></div></div>