[New-bugs-announce] [issue34914] Clarify text encoding used to enable UTF-8 mode

Nick Coghlan report at bugs.python.org
Sat Oct 6 04:06:57 EDT 2018


New submission from Nick Coghlan <ncoghlan at gmail.com>:

While working on the docs updates for bpo-34589 (clarifying that "PYTHONCOERCECLOCALE=0" and "PYTHONCOERCELOCALE=warn" need both the environment variable name and the value to be encoded as ASCII in order to have any effect), I realised that it was less explicit how to reliably enable UTF-8 mode, since that can be enabled even when the current locale is a nominally ASCII-incompatible one like gb18030, and the command line settings get processed as wchar strings rather than 8-bit char strings.

>From what I've been able to figure out, the environment variable case is the same as for locale coercion: both the environment variable name and the value need to be encoded as ASCII. This actually happens implicitly, as even encodings like gb18030 still encode ASCII letters and numbers the same way ASCII does - their incompatibilities with ASCII lie elsewhere. Fully incompatible encodings like UTF-16 and UTF-32 don't get used as locale encodings in the first place because they'd break too many applications.

I believe the same holds true for the command line arguments, just in the other direction: they get converted to wchar* with either mbstowcs or mrbtowc, and then compared using wcscmp or wcsncmp, but for all encodings that actually get used as locale encodings, the ASCII code points that CPython cares about get mapped directly to the corresponding UTF-16-LE or UTF-32 code point at both compile time (in the code) and at runtime (when reading the arg string).

Given that simply not thinking about the problem will actually do the right thing in all cases, I don't think this needs to be documented prominently, but I do think it would be good to explicitly address the point somewhere.

----------
assignee: docs at python
components: Documentation
messages: 327236
nosy: docs at python, eric.snow, ncoghlan, vstinner
priority: low
severity: normal
stage: needs patch
status: open
title: Clarify text encoding used to enable UTF-8 mode
type: enhancement
versions: Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34914>
_______________________________________


More information about the New-bugs-announce mailing list