[Python-Dev] PEP 540: Add a new UTF-8 mode (v2)

Nick Coghlan ncoghlan at gmail.com
Wed Dec 6 00:46:17 EST 2017


Something I've just noticed that needs to be clarified: on Linux, "C"
locale and "POSIX" locale are aliases, but this isn't true in general
(e.g. it's not the case on *BSD systems, including Mac OS X).

To handle that in PEP 538, I made it clear that everything is keyed
specifically off the "C" locale, since that's what you actually get by
default.

So if PEP 540 is going to implicitly trigger switching encodings, it
needs to specify whether it's going to look for the C locale or the
POSIX locale (I'd suggest C locale, since that's the actual default
that causes problems).

The precedence relationship with locale coercion also needs to be
spelled out: successful locale coercion should skip implicitly
enabling UTF-8 mode (for opt-in UTF-8 mode, we'd still try to coerce
the locale setting as appropriate, so extensions modules are more
likely to behave themselves).

On 6 December 2017 at 14:07, INADA Naoki <songofacandy at gmail.com> wrote:
> Oh, revised version is really short!
>
> And I have one worrying point.
> With UTF-8 mode, open()'s default encoding/error handler is
> UTF-8/surrogateescape.
>
> Containers are really growing.  PyCharm supports Docker and many new Python
> developers use Docker instead of installing Python directly on their system,
> especially on Windows.
>
> And opening binary file without "b" option is very common mistake of new
> developers.  If default error handler is surrogateescape, they lose a chance
> to notice their bug.
>
> On the other hand, it helps some use cases when user want byte-transparent
> behavior, without modifying code to use "surrogateescape" explicitly.
>
> Which is more important scenario?  Anyone has opinion about it?
> Are there any rationals and use cases I missing?

For platforms that offer a C.UTF-8 locale, I'd like "LC_CTYPE=C.UTF-8
python" and "PYTHONCOERCECLOCALE=0 LC_CTYPE=C PYTHONUTF8=1" to be
equivalent (aside from the known limitation that extension modules may
not do the right thing in the latter case).

For the locale coercion case, the default error handler for `open`
remains as "strict", which means I'd be in favour of keeping it as
"strict" by default in UTF-8 mode as well. That would flip the toggle
in the PEP: "strict UTF-8" would be the default selection for
"PYTHONUTF8=1, and you'd choose the more relaxed option via
"PYTHONUTF8=permissive".

That way, the combination of PEPs 538 and 540 would give us the
following situation in the C locale:

1. Our preferred approach is to coerce LC_CTYPE in the C locale to a
UTF-8 based equivalent
2. Only if that fails (e.g. as it will on CentOS 7) do we resort to
implicitly enabling CPython's internal UTF-8 mode (which should behave
like C.UTF-8, *except* for the fact extension modules won't respect
it)

That way, the ideal outcome is that a UTF-8 based locale exists, and
we use it automatically when needed. UTF-8 mode than lets us cope with
older platforms where neither C.UTF-8 nor an equivalent exists.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list