[Python-ideas] PEP 540: Add a new UTF-8 mode

Thu Jan 12 00:05:39 EST 2017

It seems to me that having a C locale can mean two things:

1) It really is meant to be ASCII

2) It's mis-configured (or un-configured), meaning the system encoding is
unknown.

if (2) then utf-8 is a fine default.

if (2), then there are two options:

1) Everything on the sytsem really is ASCII -- in which case, utf-8 would
"just work" -- no problem.

2) There are non-ascii file names, etc. on this supposedly ASCII system. In
which case, do folks expect their Python programs to find these issues and
raise errors? They may well expect that their Python program will not let
them try to save a non ASCII filename, for instance. But I suspect that
they wouldn't want it to raise an obscure encoding error -- but rather
would want the app to do somethign friendly.

So I see no downside to using utf-8 when the C locale is defined.

-CHB

On Wed, Jan 11, 2017 at 4:23 PM, INADA Naoki <songofacandy at gmail.com> wrote:

> > My PEP 540 is different than Nick's PEP 538, even for the POSIX
> > locale. I propose to always use the surrogateescape error handler,
> > whereas Nick wants to keep the strict error handler for inputs and
> > outputs.
> > https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler
> >
> > The surrogateescape error handler is useful to write programs which
> > work as pipes, as cat, grep, sed, ... UNIX program:
> > https://www.python.org/dev/peps/pep-0540/#producer-
> consumer-model-using-pipes
> >
> > You can get the behaviour of Nick's PEP 538 using my UTF-8 Strict
> > mode. Compare "UTF-8 mode" and "UTF-8 Strict mode" lines in the tables
> > of my use case. The UTF-8 mode always works, but can produce mojibake,
> > whereas UTF-8 Strict doesn't produce mojibake but can fail depending
> > on data and the locale.
> >
> > IMHO most users prefers usability ("just work") over correctness
> > (prevent mojibake).
> >
>
> I'm ±0 to surrogateescape by default.  I feel +1 for stdout and -1 for
> stdin.
>
> In output case, surrogateescape is weaker than strict, but it only allows
> surrgateescaped binary.  If program carefully use surrogateescaped decode,
> surrogateescape on stdout is safe enough.
>
> On the other hand, surrogateescape is very weak for input.  It accepts
> arbitrary bytes.
> It should be used carefully.
>
> But I agree different encoding handler between stdin/stdout is not
> beautiful.
> That's why I'm ±0.
>
>
> FYI, when http://bugs.python.org/issue15216 is merged, we can change
> error handler easily: ``sys.stdout.set_encoding(
> errors='surrogateescape')``
>
> So it's controllable from Python.  Some program which handles filenames may
> prefer surrogateescape, and some program like CGI may prefer strict
> UTF-8 because
> JSON and HTML5 shouldn't contain arbitrary bytes.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170111/c85bd8e2/attachment.html>