Force UTF-8 option regardless locale

On Tue, Aug 30, 2016 at 8:14 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
Some people loves tiny Linux image for Docker and RasberryPi. They doesn't has any locale other than C. Some OPs loves LANG=C or LC_ALL=C to avoid troubles and unexpected performance regression caused by locale. (e.g. sort command is much slower on ja_JP.utf8). I want to write script using utf-8 for stdio and fsencoding. Sometimes, people runs my script in C locale. And sometimes runs in misconfigured locale because SSH sends LANG that system doesn't have. So I wonder if Python has Force UTF-8" option. And if the option is configure option or site-wide installation option, because: * command line option cannot be set in shebang * Setting environment variable may be forgetten when writing scripts like crontab. The option may make startup bit faster, because it can skip setting locale in startup. Any thoughts? How should the option be set? -- INADA Naoki <songofacandy@gmail.com>

On 30 August 2016 at 10:05, INADA Naoki <songofacandy@gmail.com> wrote:
We run into this for CentOS images as well - the Docker images currently still default to C, as they don't have C.UTF-8 available (although you can set LANG=en_US.UTF-8 in your Dockerfile) (I think Fedora has started defaulting to C.UTF-8 now, but I haven't actually checked recently)
Broad availability of C.UTF-8 will hopefully help mitigate that behaviour, but there's still a long transition ahead on that front, as it seems unlikely "LANG=C" will ever be redefined to mean "LANG=C.UTF-8", so folks have to explicitly request "LANG=C.ASCII" to get the old US-centric behaviour :(
While I agree this is a good way to go, we unfortunately don't have a lot of precedent to work with here :( The closest we've had to date to a "CPython runtime configuration file" is the implementation dependent cert verification config file in PEP 493: https://www.python.org/dev/peps/pep-0493/#backporting-pep-476-to-earlier-pyt... Since that was designed specifically as a migration tool for the RHEL system Python, it glosses over a lot of things we'd need to care about for a proper config file, like: - how it works when running from a local checkout - how (or if) to support parallel installations - how (or if) to support virtual environments - how (or if) to support per-user overrides - how (or if) to support environment variable overrides - how (or if) to support command line overrides - how to support Windows - whether we're defining this as a CPython-only thing, or whether we'd expect other implementations to support it as well However, a config file was desirable in the cert verification case for the same reasons you mention here: so it can be visible system wide, without requiring changes to environment variables or command invocations. We do have a per-venv config file (pyvenv.cfg), but that's currently an implementation detail of the 'venv' module, rather than a clearly defined standard format. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le 30 août 2016 02:05, "INADA Naoki" <songofacandy@gmail.com> a écrit :
How should the option be set?
I propose to add a new -X utf8 option. Maybe if the use case is important, we might add an PYTHONUTF8 environment variable. The problem is that I'm not sure that an env var is the right way to configure Python on such environment? But an env var shouldn't hurt and it is common to add a new env var with a new cmdline option. I added PYTHONFAULTHANDLER=1/-X faulthandler for faulthandler and PYTHONTRACEMALLOC=N/-X tracemalloc=N for tracemalloc. Victor

On 30.08.2016 10:29, Victor Stinner wrote:
In PyRun we simply define a default for PYTHONIOENCODING and set this to utf-8: http://www.egenix.com/products/python/PyRun/doc/#_Toc452660008 The encoding guessing is still available by setting the env var to "" (but this is hardly used). So far this has been working great. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 30 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Aug 31, 2016 at 4:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
My concern is, people other than me running Python scripts on such systems (which has only C locale). Most unix commands runs well in C locale. But only Python script get many trouble. * locale error when just running Python script. (when bad LANG setting). * Unicode error happen when stdout is piped, while runs well when without pipe (when LANG=C, and no PYTHONIOENCODING set). * open() without explicit `encoding='utf-8'` runs well on Mac and LANG=*.utf8 environment. But UnicodeError happen on LANG=C environment. (Actually speaking, I and my company doesn't use UTF-8 filename. So we don't get trouble about fsencoding. But some other companies may.) On such system, site-wide configuration to override `nl_langinfo(CODESET)` may help people. Otherwise: 1 Face locale error when running Python script, and write LANG=C to their .bashrc. 2 Face UnicodeError when piping from Python script, and write PYTHONIOENCODING=utf-8 in their .bashrc. 3 Face UnicodeError when reading/writing from text file, and add explicit `encoding='utf-8'` (This bug may be not found on CI environment having *.UTF-8 locale, and happens in production environment) 4 Finally, people feel Python is troublesome language, and they don't want to use Python anymore. I know about `/etc/environment` file. But OPs doesn't like adding lines to it only for Python. They feel "Perl (or Ruby) is better than Python". This is why I think configuration option or site-wide configuration is desirable even if we have PYTHON(IO|FS|PREFERRED)ENCODINGS environment variables.
-- INADA Naoki <songofacandy@gmail.com>

On 30 August 2016 at 10:05, INADA Naoki <songofacandy@gmail.com> wrote:
We run into this for CentOS images as well - the Docker images currently still default to C, as they don't have C.UTF-8 available (although you can set LANG=en_US.UTF-8 in your Dockerfile) (I think Fedora has started defaulting to C.UTF-8 now, but I haven't actually checked recently)
Broad availability of C.UTF-8 will hopefully help mitigate that behaviour, but there's still a long transition ahead on that front, as it seems unlikely "LANG=C" will ever be redefined to mean "LANG=C.UTF-8", so folks have to explicitly request "LANG=C.ASCII" to get the old US-centric behaviour :(
While I agree this is a good way to go, we unfortunately don't have a lot of precedent to work with here :( The closest we've had to date to a "CPython runtime configuration file" is the implementation dependent cert verification config file in PEP 493: https://www.python.org/dev/peps/pep-0493/#backporting-pep-476-to-earlier-pyt... Since that was designed specifically as a migration tool for the RHEL system Python, it glosses over a lot of things we'd need to care about for a proper config file, like: - how it works when running from a local checkout - how (or if) to support parallel installations - how (or if) to support virtual environments - how (or if) to support per-user overrides - how (or if) to support environment variable overrides - how (or if) to support command line overrides - how to support Windows - whether we're defining this as a CPython-only thing, or whether we'd expect other implementations to support it as well However, a config file was desirable in the cert verification case for the same reasons you mention here: so it can be visible system wide, without requiring changes to environment variables or command invocations. We do have a per-venv config file (pyvenv.cfg), but that's currently an implementation detail of the 'venv' module, rather than a clearly defined standard format. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le 30 août 2016 02:05, "INADA Naoki" <songofacandy@gmail.com> a écrit :
How should the option be set?
I propose to add a new -X utf8 option. Maybe if the use case is important, we might add an PYTHONUTF8 environment variable. The problem is that I'm not sure that an env var is the right way to configure Python on such environment? But an env var shouldn't hurt and it is common to add a new env var with a new cmdline option. I added PYTHONFAULTHANDLER=1/-X faulthandler for faulthandler and PYTHONTRACEMALLOC=N/-X tracemalloc=N for tracemalloc. Victor

On 30.08.2016 10:29, Victor Stinner wrote:
In PyRun we simply define a default for PYTHONIOENCODING and set this to utf-8: http://www.egenix.com/products/python/PyRun/doc/#_Toc452660008 The encoding guessing is still available by setting the env var to "" (but this is hardly used). So far this has been working great. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Aug 30 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Aug 31, 2016 at 4:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
My concern is, people other than me running Python scripts on such systems (which has only C locale). Most unix commands runs well in C locale. But only Python script get many trouble. * locale error when just running Python script. (when bad LANG setting). * Unicode error happen when stdout is piped, while runs well when without pipe (when LANG=C, and no PYTHONIOENCODING set). * open() without explicit `encoding='utf-8'` runs well on Mac and LANG=*.utf8 environment. But UnicodeError happen on LANG=C environment. (Actually speaking, I and my company doesn't use UTF-8 filename. So we don't get trouble about fsencoding. But some other companies may.) On such system, site-wide configuration to override `nl_langinfo(CODESET)` may help people. Otherwise: 1 Face locale error when running Python script, and write LANG=C to their .bashrc. 2 Face UnicodeError when piping from Python script, and write PYTHONIOENCODING=utf-8 in their .bashrc. 3 Face UnicodeError when reading/writing from text file, and add explicit `encoding='utf-8'` (This bug may be not found on CI environment having *.UTF-8 locale, and happens in production environment) 4 Finally, people feel Python is troublesome language, and they don't want to use Python anymore. I know about `/etc/environment` file. But OPs doesn't like adding lines to it only for Python. They feel "Perl (or Ruby) is better than Python". This is why I think configuration option or site-wide configuration is desirable even if we have PYTHON(IO|FS|PREFERRED)ENCODINGS environment variables.
-- INADA Naoki <songofacandy@gmail.com>
participants (4)
-
INADA Naoki
-
M.-A. Lemburg
-
Nick Coghlan
-
Victor Stinner