[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Brett Cannon brett at python.org
Wed Dec 6 13:23:46 EST 2017


On Wed, 6 Dec 2017 at 06:10 INADA Naoki <songofacandy at gmail.com> wrote:

> >> And I have one worrying point.
> >> With UTF-8 mode, open()'s default encoding/error handler is
> >> UTF-8/surrogateescape.
> >
> > The Strict UTF-8 Mode is for you if you prioritize correctness over
> usability.
>
> Yes, but as I said, I cares about not experienced developer
> who doesn't know what UTF-8 mode is.
>
> >
> > In the very first version of my PEP/idea, I wanted to use
> > UTF-8/strict. But then I started to play with the implementation and I
> > got many "practical" issues. Using UTF-8/strict, you quickly get
> > encoding errors. For example, you become unable to read undecodable
> > bytes from stdin. stdin.read() only gives you an error, without
> > letting you decide how to handle these "invalid" data. Same issue with
> > stdout.
> >
>
> I don't care about stdio, because PEP 538 uses surrogateescape for
> stdio/error
>
> https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams
>
> I care only about builtin open()'s behavior.
> PEP 538 doesn't change default error handler of open().
>
> I think PEP 538 and PEP 540 should behave almost identical except
> changing locale
> or not.  So I need very strong reason if PEP 540 changes default error
> handler of open().
>

I don't have enough locale experience to weigh in as an expert, but I
already was leaning towards INADA-san's logic of not wanting to change
open() and this makes me really not want to change it.

-Brett


>
>
> > In the old long version of the PEP, I tried to explain UTF-8/strict
> > issues with very concrete examples, the removed "Use Cases" section:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490
> >
> > Tell me if I should rephrase the rationale of the PEP 540 to better
> > justify the usage of surrogateescape.
>
> OK, "List a directory into a text file" example demonstrates why
> surrogateescape
> is used for open().  If os.listdir() returns surrogateescpaed data,
> file.write() will be
> fail.
> All other examples are about stdio.
>
> But we should achieve good balance between correctness and usability of
> default behavior.
>
> >
> > Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with
> > surrogateescape, or backslashreplace for stderr, or surrogatepass for
> > fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But
> > the PEP title would be too long, no? :-)
> >
>
> I feel short name is enough.
>
> >
> >> And opening binary file without "b" option is very common mistake of new
> >> developers.  If default error handler is surrogateescape, they lose a
> chance
> >> to notice their bug.
> >
> > When open() in used in text mode to read "binary data", usually the
> > developer would only notify when getting the POSIX locale (ASCII
> > encoding). But the PEP 538 already changed that by using the C.UTF-8
> > locale (and so the UTF-8 encoding, instead of the ASCII encoding).
> >
>
> With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not
> UTF-8/surrogateescape.
>
> For example, this code raise UnicodeDecodeError with PEP 538 if the
> file is JPEG file.
>
>     with open(fn) as f:
>         f.read()
>
>
> > I'm not sure that locales are the best way to detect such class of
> > bytes. I suggest to use -b or -bb option to detect such bugs without
> > having to care of the locale.
> >
>
> But many new developers doesn't use/know -b or -bb option.
>
> >
> >> On the other hand, it helps some use cases when user want
> byte-transparent
> >> behavior, without modifying code to use "surrogateescape" explicitly.
> >>
> >> Which is more important scenario?  Anyone has opinion about it?
> >> Are there any rationals and use cases I missing?
> >
> > Usually users expect that Python 3 "just works" and don't bother them
> > with the locale (thay nobody understands).
> >
> > The old version of the PEP contains a long list of issues:
> >
> https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986
> >
> > I already replaced the strict error handler with surrogateescape for
> > sys.stdin and sys.stdout on the POSIX locale in Python 3.5:
> > https://bugs.python.org/issue19977
> >
> > For the rationale, read for example these comments:
> >
> [snip]
>
> OK, I'll read them and think again about open()'s default behavior.
> But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.
>
> Regards,
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/brett%40python.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20171206/d957b024/attachment.html>


More information about the Python-Dev mailing list