[Python-Dev] PEP 538 (review round 2): Coercing the legacy C locale to a UTF-8 based locale

Martin (gzlist) gzlist at googlemail.com
Sun Jun 11 20:05:23 EDT 2017


On 09/05/2017, Nick Coghlan <ncoghlan at gmail.com> wrote:
>
> Enough changes have accumulated in PEP 538 since the start of the
> previous thread that it seems sensible to me to start a new thread
> specifically covering the current design (which aims to address all
> the concerns raised in the previous thread).
>
> I haven't requoted the PEP in full since it's so long, but will
> instead refer readers to the web version:
> https://www.python.org/dev/peps/pep-0538/

I did try to follow along via the mailing list threads, and have now
read over the PEP again. Responding now as I'm actually touching code
relevent to this again.

Broadly the proposal looks good to me. It does help one of the two
cases I care about, and does no serious harm.

For a command line Python script, making sure Python itself uses UTF-8
for the C locale is sufficient, and setting LC_CTYPE so spawned
processes that aren't Python have a chance at doing the right thing
too is a reasonable upgrade. This is probably good enough to drop one
hack[1] rather than porting it to Python 3.

For hosted Python code this does nothing (apart from print to stderr),
so mod_wsgi for instance is still going to need the same kind dance to
get users to set LANG as configuration themselves. Ideally this PEP
would have a C api or something so I could file bugs to make it just
do the right thing.

A few notes on specifics, I'll try not to stray too much into choices
already made.

The PEP doesn't persuade me that Py_Initialize actually is too late to
switch *specifically* from ascii to utf-8. Any preceeding operations
that operate on unicode would have been a safe subset. There might be
issues with other internals, or surrogateescape, or it's just a pain?

I don't like the side effect of changing the standard stream error
handler to surrogateescape if LANG=C.UTF-8 is actually set. Obviously
bad data vs exception is a trade off anyway, but means to get a Python
script that will always output valid data or exit, you have to set an
arbitrary language like en_US. Yes, that's true of the change as
implemented in 3.5 anyway.

Not setting LANG and only setting LC_CTYPE seems fine. Obviously,
things can go wrong based on odd behaviours of spawned processes, but
it works for the normal idioms.

I'm not sold on adding the PYTHONCOERCECLOCALE runtime configuration.
All it really does is turn off stderr kipple if you must use the C
locale for other reasons? Anyone with the ability to set that variable
could just set LANG instead. I was going to suggest just documenting
LC_ALL=C as the override instead of adding a python specific variable,
but note looking around that Debian discourage that[3].

That's all, though I will also grumble a bit about how long the PEP is.

Martin


[1] Override Py_FileSystemDefaultEncoding to utf-8 from ascii for the bzr script
<https://code.launchpad.net/~gz/bzr/filesystem_default_encoding_794353/+merge/85170>
[2] WSGIDaemonProcess lang and locale options
<https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html>
[3] "Using LC_ALL is strongly discouraged as it overrides everything"
<https://wiki.debian.org/Locale#Configuration>


More information about the Python-Dev mailing list